Chasing Sparsity in Vision Transformers:An End-to-End Exploration

Tianlong Chen; Yu Cheng; Zhe Gan; Lu Yuan; Lei Zhang; Zhangyang Wang

ビジョントランスフォーマーのスパース性を追う: エンドツーエンドの調査

ビジョントランスフォーマー (ViT) は最近爆発的な人気を博していますが、その巨大なモデルサイズとトレーニングコストは依然として困難です。従来のトレーニング後の剪定では、多くの場合、トレーニング予算が高くなります。対照的に、このペーパーは、達成可能な精度を犠牲にすることなく、トレーニングメモリのオーバーヘッドと推論の複雑さの両方を削減することを目的としています。 ViT のスパース性を「端から端まで」統合するという統一されたアプローチを取るという、この種の最初の包括的な調査を開始して報告します。具体的には、完全な ViT をトレーニングする代わりに、固定された小さなパラメーターバジェットに固執しながら、疎なサブネットワークを動的に抽出してトレーニングします。私たちのアプローチは、モデルパラメーターを共同で最適化し、トレーニング全体で接続性を調査し、最終的な出力として 1 つのスパースネットワークを作成します。このアプローチは、非構造化から構造化スパーシティまでシームレスに拡張されます。後者は、ViT 内の自己注意ヘッドの除去と成長をガイドすることを検討することによって行われます。効率をさらに高めるために、新しい学習可能なトークンセレクターを接続して、現在最も重要なパッチを適応的に決定することにより、データとアーキテクチャのスパース性をさらに共同調査します。広範な結果により、多様な ViT バックボーンを備えた ImageNet での提案の有効性が検証されます。たとえば、構造化スパーシティが 40% の場合、スパース化 DeiT ベースは、高密度対応と比較して、0.42% の精度向上、33.13% および 24.70% の実行時間の節約を達成できます。おそらく最も驚くべきことに、提案されたスパース (共同) トレーニングは、ViT の精度を低下させるのではなく、さらに向上させることができ、スパース性を魅力的な「無料ランチ」にすることがわかりました。たとえば、5% のスパース化された DeiT-Small、50% の (データ、アーキテクチャ) のスパース性は、0.28% のトップ 1 精度を向上させ、49.32% の FLOP と 4.40% の実行時間の節約を実現します。

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without scarifying the achievable accuracy. We launch and report the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results validate the effectiveness of our proposals on ImageNet with diverse ViT backbones. For instance, at 40% structured sparsity, our sparsified DeiT-Base can achieve 0.42% accuracy gain, at 33.13% and 24.70% running time} savings, compared to its dense counterpart. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at 5%, 50% sparsity for (data, architecture), improves 0.28% top-1 accuracy and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings.

updated: Tue Jun 08 2021 17:18:00 GMT+0000 (UTC)

published: Tue Jun 08 2021 17:18:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト