Transformers in Vision: A Survey

Salman Khan; Muzammal Naseer; Munawar Hayat; Syed Waqas Zamir; Fahad Shahbaz Khan; Mubarak Shah

ビジョンのトランスフォーマー：調査

自然言語タスクに関するTransformerモデルの驚くべき結果により、ビジョンコミュニティは、コンピュータービジョンの問題への応用を研究することに興味をそそられました。それらの顕著な利点の中で、トランスフォーマーは、入力シーケンス要素間の長い依存関係のモデリングを可能にし、長短期記憶（LSTM）などのリカレントネットワークと比較してシーケンスの並列処理をサポートします。畳み込みネットワークとは異なり、トランスフォーマーは設計に最小限の誘導バイアスを必要とし、集合関数として自然に適しています。さらに、Transformersの単純な設計により、同様の処理ブロックを使用して複数のモダリティ（画像、ビデオ、テキスト、音声など）を処理でき、非常に大容量のネットワークや巨大なデータセットに対して優れたスケーラビリティを発揮します。これらの強みは、Transformerネットワークを使用した多くのビジョンタスクのエキサイティングな進歩につながりました。この調査は、コンピュータービジョン分野におけるTransformerモデルの包括的な概要を提供することを目的としています。まず、トランスフォーマーの成功の背後にある基本的な概念、つまり、自己注意、大規模な事前トレーニング、および双方向エンコーディングの概要から始めます。次に、一般的な認識タスク（画像分類、オブジェクト検出、行動認識、セグメンテーションなど）、生成モデリング、マルチモーダルタスク（視覚的質問応答、視覚的推論、視覚的接地など）を含む視覚におけるトランスフォーマーの広範なアプリケーションについて説明します。）、ビデオ処理（例、行動認識、ビデオ予測）、低レベルのビジョン（例、画像の超解像度、画像の強調、色付け）、3D分析（例、ポイントクラウドの分類とセグメンテーション）。建築設計とその実験的価値の両方の観点から、一般的な手法のそれぞれの利点と制限を比較します。最後に、オープンな研究の方向性と可能な将来の作業に関する分析を提供します。

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

updated: Wed Jan 19 2022 05:49:50 GMT+0000 (UTC)

published: Mon Jan 04 2021 18:57:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト