Does Visual Pretraining Help End-to-End Reasoning?

Chen Sun; Calvin Luo; Xingyi Zhou; Anurag Arnab; Cordelia Schmid

視覚的な事前トレーニングはエンドツーエンドの推論に役立ちますか?

私たちは、視覚的事前学習の助けを借りて、視覚的推論のエンドツーエンド学習が汎用ニューラルネットワークで達成できるかどうかを調査することを目的としています。肯定的な結果は、明示的な視覚的抽象化 (例: オブジェクト検出) が視覚推論の構成的一般化に不可欠であるという一般的な信念に反論し、視覚認識および推論タスクを解決するためのニューラルネットワークの「ジェネラリスト」の実現可能性を裏付けるでしょう。我々は、変換ネットワークを使用して各ビデオフレームを小さなトークンのセットに「圧縮」し、圧縮された時間コンテキストに基づいて残りのフレームを再構築する、シンプルで一般的な自己監視型フレームワークを提案します。再構成損失を最小限に抑えるために、ネットワークは各画像のコンパクトな表現を学習するだけでなく、時間的なコンテキストから時間的なダイナミクスとオブジェクトの永続性をキャプチャする必要があります。 CATER と ACRE という 2 つの視覚的推論ベンチマークで評価を実行します。エンドツーエンドの視覚的推論の構成的一般化を達成するには、事前トレーニングが不可欠であることがわかります。私たちが提案したフレームワークは、画像分類や明示的なオブジェクト検出を含む従来の教師あり事前トレーニングよりも大幅に優れています。

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.

updated: Mon Jul 17 2023 14:08:38 GMT+0000 (UTC)

published: Mon Jul 17 2023 14:08:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト