Transformers in Vision: A Survey

Salman Khan; Muzammal Naseer; Munawar Hayat; Syed Waqas Zamir; Fahad Shahbaz Khan; Mubarak Shah

ビジョンにおけるトランスフォーマー: 調査

自然言語タスクにおけるトランスフォーマーモデルの驚くべき結果は、コンピュータビジョン問題への応用を研究するビジョンコミュニティの興味をそそるものだった。これにより、モデル設計に誘導バイアスを最小限に抑えながらも、多くのタスクで目覚ましい進歩を遂げてきた。この調査は、コンピュータビジョン分野におけるトランスフォーマーモデルの包括的な概要を提供することを目的としており、この分野での予備知識がほとんどないことを前提としている。まず、トランスフォーマーモデルの成功の背景にある基本的な概念、すなわち自己監視と自己注意について紹介する。トランスフォーマーアーキテクチャは、自己注意メカニズムを利用して入力領域の長距離依存性を符号化することで、非常に表現力を高めている。問題の構造に関する最小限の事前知識を前提としているため、事前課題を用いた自己学習は、大規模な(ラベル付けされていない)データセット上でのトランスフォーマーモデルの事前学習に適用される。学習された表現は下流のタスクで微調整され、典型的には符号化された特徴の一般化と表現力により優れた性能を発揮する。一般的な認識タスク(例: 画像分類、物体検出、行動認識、セグメンテーション)、生成モデリング、マルチモーダルタスク(例: 視覚的質問応答、視覚的推論)、ビデオ処理(例: 活動認識、ビデオ予測)、低レベルビジョン(例: 画像の超解像、着色)、3D解析(例: 点群分類、セグメンテーション)などを含む、ビジョンにおける変換器の広範なアプリケーションをカバーする。アーキテクチャ設計と実験的価値の両面から、一般的な技術のそれぞれの利点と限界を比較する。最後に、今後の研究の方向性と可能性についての分析を行う。

Astounding results from transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. This has led to exciting progress on a number of tasks while requiring minimal inductive biases in the model design. This survey aims to provide a comprehensive overview of the transformer models in the computer vision discipline and assumes little to no prior background in the field. We start with an introduction to fundamental concepts behind the success of transformer models i.e., self-supervision and self-attention. Transformer architectures leverage self-attention mechanisms to encode long-range dependencies in the input domain which makes them highly expressive. Since they assume minimal prior knowledge about the structure of the problem, self-supervision using pretext tasks is applied to pre-train transformer models on large-scale (unlabelled) datasets. The learned representations are then fine-tuned on the downstream tasks, typically leading to excellent performance due to the generalization and expressivity of encoded features. We cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering and visual reasoning), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

updated: Mon Jan 04 2021 18:57:24 GMT+0000 (UTC)

published: Mon Jan 04 2021 18:57:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト