ViTAS: Vision Transformer Architecture Search

Xiu Su; Shan You; Jiyang Xie; Mingkai Zheng; Fei Wang; Chen Qian; Changshui Zhang; Xiaogang Wang; Chang Xu

ViTAS：VisionTransformerアーキテクチャ検索

ビジョントランスフォーマー（ViT）はNLPの成功を継承しましたが、その構造は十分に調査されておらず、視覚的なタスクに最適化されていません。最も簡単な解決策の1つは、CNNで広く使用されているニューラルアーキテクチャ検索（NAS）を介して最適なものを直接検索することです。しかし、経験的に、この単純な適応は壊滅的な失敗に遭遇し、スーパーフォーマーのトレーニングにとってイライラするほど不安定になることがわかりました。この論文では、ViTは主に誘導バイアスの少ないトークン埋め込みで動作するため、異なるアーキテクチャのチャネルの不均衡は、重み共有の仮定を悪化させ、結果としてトレーニングの不安定性を引き起こすと主張します。したがって、ViTのトークン埋め込み用の新しい循環ウェイト共有メカニズムを開発します。これにより、各チャネルがすべての候補アーキテクチャにさらに均等に貢献できるようになります。さらに、スーパーフォーマーの多対1の問題を軽減し、経験的に安定したトレーニングを行うために弱い拡張および正則化手法を活用するために、アイデンティティシフトも提案します。これらに基づいて、提案された方法であるViTASは、DeiTベースとTwinsベースのViTの両方で大きな優位性を達成しました。たとえば、わずか1.4G FLOPの予算で、検索されたアーキテクチャのImageNet-1kの精度はベースラインのDeiTより3.3％です。 3.0G FLOPを使用すると、ImageNet-1kで82.0％の精度、COCO2017で45.9％のmAPを達成し、他のViTより2.4％優れています。

Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose identity shifting to alleviate the many-to-one issue in superformer and leverage weak augmentation and regularization techniques for more steady training empirically. Based on these, our proposed method, ViTAS, has achieved significant superiority in both DeiT- and Twins-based ViTs. For example, with only 1.4G FLOPs budget, our searched architecture has 3.3% ImageNet-1k accuracy than the baseline DeiT. With 3.0G FLOPs, our results achieve 82.0% accuracy on ImageNet-1k, and 45.9% mAP on COCO2017 which is 2.4% superior than other ViTs.

updated: Tue Nov 30 2021 12:33:40 GMT+0000 (UTC)

published: Fri Jun 25 2021 15:39:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト