Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Guglielmo Camporese; Elena Izzo; Lamberto Ballan

私の隣人はどこにいますか？自己監視型VisionTransformerでのパッチ関係の活用

Vision Transformers（ViT）は、大きなデータセットでトレーニングされたときに印象的なパフォーマンスを示すビジョンタスクでのトランスアーキテクチャの使用を可能にしました。ただし、比較的小さなデータセットでは、誘導バイアスがないため、ViTの精度は低くなります。この目的のために、外部注釈なしで結果を大幅に改善できるViTをトレーニングするための、シンプルでありながら効果的な自己監視学習（SSL）戦略を提案します。具体的には、モデルがダウンストリームトレーニングの前または共同で解決する必要があるイメージパッチの関係に基づいて、一連のSSLタスクを定義します。 ViTとは異なり、RelViTモデルは、画像パッチに関連するトランスフォーマーエンコーダーのすべての出力トークンを最適化するため、各トレーニングステップでより多くのトレーニング信号を活用します。いくつかの画像ベンチマークで提案された方法を調査したところ、RelViTは、特に小さなデータセットで、SSLの最先端の方法を大幅に改善することがわかりました。

Vision Transformers (ViTs) enabled the use of transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step. We investigated our proposed methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets.

updated: Wed Jun 01 2022 13:25:32 GMT+0000 (UTC)

published: Wed Jun 01 2022 13:25:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト