Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

Khawar Islam

Vision Transformerの最近の進歩：最近の作業の調査と展望

ビジョントランスフォーマー（ViT）は、畳み込みニューラルネットワーク（CNN）と比較して、さまざまなビジョンタスクでより一般的になり、支配的な手法になっています。コンピュータビジョンの要求の厳しい技術として、ViTは、長距離の関係に焦点を合わせながら、さまざまな視覚の問題を解決することに成功しています。この論文では、自己注意メカニズムの基本的な概念と背景を紹介することから始めます。次に、強みと弱み、計算コスト、トレーニングとテストのデータセットの観点から説明する、最近の最高のパフォーマンスを発揮するViTメソッドの包括的な概要を示します。一般的なベンチマークデータセットで、さまざまなViTアルゴリズムと最も代表的なCNNメソッドのパフォーマンスを徹底的に比較します。最後に、洞察に満ちた観察でいくつかの制限を探り、さらなる研究の方向性を提供します。プロジェクトページと論文のコレクションは、https：//github.com/khawar512/ViT-Surveyで入手できます。

Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are available at https://github.com/khawar512/ViT-Survey

updated: Thu Mar 10 2022 04:46:56 GMT+0000 (UTC)

published: Thu Mar 03 2022 06:17:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト