Searching for Efficient Multi-Stage Vision Transformers

Yi-Lun Liao; Sertac Karaman; Vivienne Sze

効率的なマルチステージビジョントランスフォーマーの検索

Vision Transformer（ViT）は、自然言語処理用のTransformerをコンピュータービジョンタスクに適用でき、コンピュータービジョンで長年研究および採用されてきた畳み込みニューラルネットワーク（CNN）と同等のパフォーマンスを実現できることを示しています。これは当然、CNNの設計手法を使用してViTのパフォーマンスをどのように向上させることができるかという問題を提起します。この目的のために、2つの手法を取り入れて、ニューラルアーキテクチャ検索（NAS）で設計された効率的なマルチステージViTアーキテクチャであるViT-ResNASを紹介することを提案します。まず、より深い層のシーケンス長を短縮し、多段アーキテクチャを利用するために、残留空間削減を提案します。長さを短くする場合は、スキップ接続を追加してパフォーマンスを向上させ、より深いネットワークのトレーニングを安定させます。次に、マルチアーキテクチャサンプリングを使用した重み共有NASを提案します。ネットワークを拡大し、そのサブネットワークを利用して検索空間を定義します。次に、すべてのサブネットワークをカバーするスーパーネットワークが、それらのパフォーマンスを迅速に評価するためにトレーニングされます。スーパーネットワークを効率的にトレーニングするために、1つの前後パスで複数のサブネットワークをサンプリングしてトレーニングすることを提案します。その後、進化的検索を実行して、高性能ネットワークアーキテクチャを発見します。 ImageNetでの実験は、ViT-ResNASが元のDeiTおよびViTの他の強力なベースラインよりも優れた精度（MACおよび精度スループットのトレードオフ）を達成することを示しています。コードはhttps://github.com/yilunliao/vit-searchで入手できます。

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple sub-networks with one forward-backward pass. After that, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate that ViT-ResNAS achieves better accuracy-MACs and accuracy-throughput trade-offs than the original DeiT and other strong baselines of ViT. Code is available at https://github.com/yilunliao/vit-search.

updated: Wed Sep 01 2021 22:37:56 GMT+0000 (UTC)

published: Wed Sep 01 2021 22:37:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト