Self-supervised Learning by View Synthesis

Shaoteng Liu; Xiangyu Zhang; Tao Hu; Jiaya Jia

ビュー合成による自己教師あり学習

このペーパーでは、ビジョントランスフォーマー用に設計された自己教師あり学習フレームワークであるビュー合成オートエンコーダー (VSA) を紹介します。従来の 2D 事前トレーニング方法とは異なり、VSA はマルチビューデータで事前トレーニングできます。各反復では、VSA への入力は 3D オブジェクトの 1 つのビュー (または複数のビュー) であり、出力は別のターゲットポーズの合成画像です。 VSA のデコーダーにはいくつかの相互注意ブロックがあり、ソースビューを値として、ソースポーズをキーとして、ターゲットポーズをクエリとして使用します。彼らはクロスアテンションを達成して、ターゲットビューを合成します。この単純なアプローチは、広角ビュー合成を実現し、空間不変表現を学習します。後者は、ModelNet40、ShapeNet Core55、および ScanObjectNN での 3D 分類など、ダウンストリームタスクのトランスフォーマーの適切な初期化です。 VSA は、リニアプロービングでは既存の方法よりも大幅に優れており、微調整では競争力があります。コードは公開されます。

We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers. Different from traditional 2D pretraining methods, VSA can be pre-trained with multi-view data. In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose. The decoder of VSA has several cross-attention blocks, which use the source view as value, source pose as key, and target pose as query. They achieve cross-attention to synthesize the target view. This simple approach realizes large-angle view synthesis and learns spatial invariant representation, where the latter is decent initialization for transformers on downstream tasks, such as 3D classification on ModelNet40, ShapeNet Core55, and ScanObjectNN. VSA outperforms existing methods significantly for linear probing and is competitive for fine-tuning. The code will be made publicly available.

updated: Sat Apr 22 2023 06:12:13 GMT+0000 (UTC)

published: Sat Apr 22 2023 06:12:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト