Self-supervised video pretraining yields human-aligned visual representations

Nikhil Parthasarathy; S. M. Ali Eslami; João Carreira; Olivier J. Hénaff

自己監視型ビデオの事前トレーニングにより、人間の位置に合わせた視覚表現が得られます

人間は、時間の経過とともにそれらがどのように進化するかを観察することによって、物体やシーンの強力な表現を学びます。しかし、明示的な時間的理解を必要とする特定のタスクを除けば、静的画像の事前トレーニングは依然として視覚基礎モデルを学習するための主要なパラダイムです。私たちはこの不一致に疑問を持ち、ビデオの事前トレーニングによって、タスク全体にわたる一般化、摂動に対する堅牢性、人間の判断との一貫性など、人間の知覚の特徴を備えた視覚的表現を生み出すことができるかどうかを問うています。そのために、私たちはビデオをキュレーションするための新しい手順を提案し、その中の複雑な変換から学ぶ対照的なフレームワークを開発します。 VITO と呼ばれる、ビデオから知識を抽出するためのこの単純なパラダイムは、画像理解タスクに関する以前のビデオ事前トレーニング方法や、ビデオ理解タスクに関する画像事前トレーニング方法をはるかに上回る一般的な表現を生成します。さらに、VITO 表現は、画像、ビデオ、および敵対的にトレーニングされたものよりも、自然および合成の変形に対して大幅に堅牢です。最後に、VITO の予測は人間の判断と強く一致しており、その目的のために特別にトレーニングされたモデルを上回っています。これらの結果を総合すると、ビデオ事前トレーニングが、視覚世界の統一的で堅牢かつ人間に合わせた表現を学習する簡単な方法である可能性があることを示唆しています。

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

updated: Tue Jul 25 2023 16:43:33 GMT+0000 (UTC)

published: Wed Oct 12 2022 17:30:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト