Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Jinghuan Shang; Srijan Das; Michael S. Ryoo

3D 空間でトークンを回復することにより、視点にとらわれない視覚的表現を学習する

人間は、3D 構造の知覚をサポートする視覚野により、視点の変化を非常に柔軟に理解できます。対照的に、2D 画像のプールから視覚的表現を学習するコンピュータービジョンモデルのほとんどは、新しいカメラの視点を一般化することに失敗することがよくあります。最近、ビジョンアーキテクチャは、画像パッチから派生したトークンで動作する、畳み込みのないアーキテクチャであるビジュアルトランスフォーマーに移行しています。ただし、これらのトランスフォーマーは、視覚的な理解のために視点にとらわれない表現を学習するための明示的な操作を実行しません。この目的のために、ビジュアルトークンの 3D 位置情報を推定し、それを活用して視点に依存しない表現を学習する 3D Token Representation Layer (3DTRL) を提案します。 3DTRL の主要な要素には、教師なしでトレーニングされたトークンに幾何学的変換を課すための疑似深度推定器と学習済みカメラマトリックスが含まれます。これらにより、3DTRL は 2D パッチからトークンの 3D 位置情報を復元できます。実際には、3DTRL は簡単に Transformer にプラグインできます。私たちの実験は、画像分類、マルチビュービデオの位置合わせ、アクション認識など、多くのビジョンタスクにおける 3DTRL の有効性を示しています。 3DTRL を使用したモデルは、計算の追加を最小限に抑えて、すべてのタスクでバックボーンの Transformer よりも優れています。コードは https://github.com/elicassion/3DTRL で入手できます。

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.

updated: Wed Oct 12 2022 22:00:53 GMT+0000 (UTC)

published: Thu Jun 23 2022 17:59:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト