Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

Philippe Weinzaepfel; Thomas Lucas; Vincent Leroy; Yohann Cabon; Vaibhav Arora; Romain Brégier; Gabriela Csurka; Leonid Antsfeld; Boris Chidlovskii; Jérôme Revaud

ステレオマッチングとオプティカルフローのための改善されたクロスビュー補完事前トレーニング

高レベルのダウンストリームタスクの印象的なパフォーマンスにもかかわらず、ステレオマッチングやオプティカルフローなどの高密度の幾何学的視覚タスクでは、自己教師ありの事前トレーニング方法はまだ完全には提供されていません。インスタンス識別やマスクされた画像モデリングなどの自己教師付き概念の幾何学的タスクへの適用は、活発な研究分野です。この作業では、最近のクロスビュー補完フレームワークに基づいて構築します。これは、同じシーンからの 2 番目のビューを活用するマスクされた画像モデリングのバリエーションであり、両眼のダウンストリームタスクに適しています。この概念の適用可能性は、これまで少なくとも 2 つの方法で制限されてきました: (a) 現実世界の画像のペアを収集することの難しさ (実際には合成データのみが使用されている) および (b) バニラの一般化の欠如相対位置が絶対位置よりも重要な高密度のダウンストリームタスクへのトランスフォーマー。改善の 3 つの方法を検討します。まず、適切な実世界の画像ペアを大規模に収集する方法を紹介します。次に、相対位置埋め込みを実験し、それらがビジョントランスフォーマーのパフォーマンスを大幅に向上させることを示します。第三に、大量のデータを使用することで可能になるビジョントランスフォーマーベースのクロスコンプリートアーキテクチャをスケールアップします。これらの改善により、相関ボリューム、反復推定、画像ワーピング、マルチスケール推論などの従来のタスク固有の手法を使用せずに、ステレオマッチングとオプティカルフローに関する最先端の結果に到達できることを初めて示しました。ユニバーサルビジョンモデルへの道を開きます。

Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of selfsupervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent crossview completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting realworld image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement: first, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that stateof-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.

updated: Thu Mar 16 2023 16:12:56 GMT+0000 (UTC)

published: Fri Nov 18 2022 18:18:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト