Learning Temporal Dynamics from Cycles in Narrated Video

Dave Epstein; Jiajun Wu; Cordelia Schmid; Chen Sun

ナレーション付きビデオのサイクルから時間的ダイナミクスを学習する

時間の経過とともに世界がどのように変化するかをモデル化することを学ぶことは、コンピュータービジョンコミュニティにとって挑戦的な問題であることが証明されています。私たちは、視覚と言語で共同で時間サイクルの一貫性を使用し、ナレーション付きビデオのトレーニングを使用して、この問題に対する自己監視ソリューションを提案します。私たちのモデルは、モダリティにとらわれない関数を学習して、時間の前後を予測します。これは、構成時に互いに元に戻す必要があります。この制約は、瞬間間の高レベルの遷移の発見につながります。これは、そのような遷移が簡単に反転され、モダリティ間で共有されるためです。サイクル整合性問題のさまざまな構成に関するアブレーション研究を使用して、モデルの設計を正当化します。次に、私たちのアプローチが未来と過去の意味のある高レベルのモデルを生み出すことを定性的および定量的に示します。学習したダイナミクスモデルを、将来のアクションの予測や画像のセットの時間的順序付けなどのさまざまなタスクにさらにトレーニングすることなく適用します。

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We propose a self-supervised solution to this problem using temporal cycle consistency jointly in vision and language, training on narrated video. Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed. This constraint leads to the discovery of high-level transitions between moments in time, since such transitions are easily inverted and shared across modalities. We justify the design of our model with an ablation study on different configurations of the cycle consistency problem. We then show qualitatively and quantitatively that our approach yields a meaningful, high-level model of the future and past. We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.

updated: Thu Jan 07 2021 02:41:32 GMT+0000 (UTC)

published: Thu Jan 07 2021 02:41:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト