Adversarial Framework for Unsupervised Learning of Motion Dynamics in   Videos

C. Spampinato; S. Palazzo; P. D'Oro; D. Giordano; M. Shah

ビデオのモーションダイナミクスの教師なし学習のための敵対的フレームワーク

Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos

ビデオでの人間の行動の理解は、まだ解決されていない複雑な問題であり、ローカル（ピクセル単位の密な予測）レベルとグローバル（モーションキューの集約）レベルの両方でモーションを正確にモデル化する必要があります。教師あり学習に基づく現在のアプローチでは、大量の注釈付きデータが必要です。そのデータの可用性は、一般的なソリューションの開発を制限する主な要因の1つです。代わりに、教師なし学習はWebで利用可能な膨大な量のビデオを活用でき、既存の制限を克服するための有望なソリューションです。本論文では、ビデオの高密度かつグローバルな予測を実行するために、自己監視メカニズムを通じてビデオ表現とダイナミクスを学習する、敵対的なGANベースのフレームワークを提案します。私たちのアプローチは、1）プロセスを静的な視覚コンテンツと動きの生成に因数分解すること、2）物体の軌跡の時空間的コヒーレンシーを実施するために運動潜在空間の適切な表現を学習すること、3）運動推定とトレーニング手順へのピクセル単位の密な予測。ジェネレーターによって生成されたモーションマスクを生成プロセスの副産物として使用することにより、自己監視が実施され、密な予測を実行する際に識別器ネットワークが監視されます。標準ベンチマークで実行されたパフォーマンス評価は、私たちのアプローチがローカルおよびグローバルビデオダイナミクスの両方を教師なしで学習できることを示しています。次に、学習した表現は、ビデオオブジェクトのセグメンテーション方法のトレーニングを、かなり少ない（約50％）アノテーションでサポートし、最新技術に匹敵するパフォーマンスを提供します。さらに、提案された方法は、現実的なビデオを生成する上で有望なパフォーマンスを達成し、特にモーション関連のメトリックスで最先端のアプローチをアウトパフォームします。

Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on supervised learning require large amounts of annotated data, whose scarce availability is one of the main limiting factors to the development of general solutions. Unsupervised learning can instead leverage the vast amount of videos available on the web and it is a promising solution for overcoming the existing limitations. In this paper, we propose an adversarial GAN-based framework that learns video representations and dynamics through a self-supervision mechanism in order to perform dense and global prediction in videos. Our approach synthesizes videos by 1) factorizing the process into the generation of static visual content and motion, 2) learning a suitable representation of a motion latent space in order to enforce spatio-temporal coherency of object trajectories, and 3) incorporating motion estimation and pixel-wise dense prediction into the training procedure. Self-supervision is enforced by using motion masks produced by the generator, as a co-product of its generation process, to supervise the discriminator network in performing dense prediction. Performance evaluation, carried out on standard benchmarks, shows that our approach is able to learn, in an unsupervised way, both local and global video dynamics. The learned representations, then, support the training of video object segmentation methods with sensibly less (about 50%) annotations, giving performance comparable to the state of the art. Furthermore, the proposed method achieves promising performance in generating realistic videos, outperforming state-of-the-art approaches especially on motion-related metrics.

updated: Tue Sep 17 2019 20:42:07 GMT+0000 (UTC)

published: Sat Mar 24 2018 11:17:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト