Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Tianlong Chen; Yu Cheng; Zhe Gan; Lu Yuan; Lei Zhang; Zhangyang Wang

ビジョントランスフォーマーのスパース性の追跡：エンドツーエンドの調査

ビジョントランスフォーマー（ViT）は最近爆発的な人気を博していますが、その膨大なモデルサイズとトレーニングコストは依然として手ごわいものです。従来のトレーニング後の剪定では、多くの場合、トレーニング予算が高くなります。対照的に、このペーパーは、達成可能な精度を犠牲にすることなく、トレーニングメモリのオーバーヘッドと推論の複雑さの両方を削減することを目的としています。私たちは、ViTにスパース性を「エンドツーエンド」で統合するという統一されたアプローチを採用することで、この種では初めての包括的な調査を実施します。具体的には、完全なViTをトレーニングする代わりに、固定された小さなパラメーターバジェットを維持しながら、スパースサブネットワークを動的に抽出してトレーニングします。私たちのアプローチは、モデルパラメータを共同で最適化し、トレーニング全体の接続性を調査し、最終的な出力として1つのまばらなネットワークになります。このアプローチは、構造化されていないスパース性から構造化されたスパース性にシームレスに拡張されます。後者は、ViT内の自己注意ヘッドの整理と成長をガイドすることを検討することによって行われます。さらに、データとアーキテクチャのスパース性を共同で調査し、新しい学習可能なトークンセレクターをプラグインして、現在最も重要なパッチを適応的に決定することで、効率をさらに向上させます。多様なViTバックボーンを備えたImageNetでの広範な結果は、大幅に削減された計算コストとほとんど損なわれていない一般化を実現する提案の有効性を検証します。おそらく最も驚くべきことに、提案されたスパース（共同）トレーニングは、ViTの精度を損なうのではなく改善し、スパース性を魅力的な「フリーランチ」にすることがあることがわかりました。たとえば、（データ、アーキテクチャ）のスパース性が（5％、50％）のスパース化されたDeiT-Smallは、トップ1の精度を0.28％向上させ、その一方で、49.32％のFLOPと4.40％の実行時間の節約を実現します。コードはhttps://github.com/VITA-Group/SViTEで入手できます。

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

updated: Fri Oct 22 2021 21:45:38 GMT+0000 (UTC)

published: Tue Jun 08 2021 17:18:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト