Vision Transformer Architecture Search

Xiu Su; Shan You; Jiyang Xie; Mingkai Zheng; Fei Wang; Chen Qian; Changshui Zhang; Xiaogang Wang; Chang Xu

VisionTransformerアーキテクチャ検索

最近、トランスフォーマーは、自己注意メカニズムを備えた手動で分割されたパッチのシーケンスとして画像をモデル化することにより、コンピュータービジョンタスクの解決において大きな優位性を示しています。ただし、ビジョントランスフォーマー（ViT）の現在のアーキテクチャは、自然言語処理（NLP）タスクから継承されたものであり、十分に調査および最適化されていません。この論文では、ビジョンタスク用のトランスの固有の構造を調べることによってさらなるステップを踏み出し、同様のハードウェアバジェットで最適なアーキテクチャを検索するためのアーキテクチャ検索方法（ViTASと呼ばれる）を提案します。具体的には、ViTの新しい効果的かつ効率的な重み共有パラダイムを設計します。これにより、トークンの埋め込み、シーケンスサイズ、ヘッド数、幅、深さが異なるアーキテクチャを1つのスーパートランスフォーマーから派生させることができます。さらに、異なるアーキテクチャの多様性に対応するために、スーパートランスフォーマーにプライベートクラストークンと自己注意マップを導入します。さらに、さまざまな予算の検索を適応させるために、アイデンティティ操作のサンプリング確率を検索することを提案します。実験結果は、ViTASが既存の純粋なトランスアーキテクチャと比較して優れた結果を達成することを示しています。たとえば、1.3G FLOPsの予算では、検索されたアーキテクチャはImageNetで74.7％のトップ1精度を達成し、現在のベースラインViTアーキテクチャよりも2.5％優れています。コードはhttps://github.com/xiusu/ViTASで入手できます。

Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce private class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with 1.3G FLOPs budget, our searched architecture achieves 74.7% top-1 accuracy on ImageNet and is 2.5% superior than the current baseline ViT architecture. Code is available at https://github.com/xiusu/ViTAS.

updated: Fri Jun 25 2021 15:39:08 GMT+0000 (UTC)

published: Fri Jun 25 2021 15:39:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト