Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

Zhenglun Kong; Haoyu Ma; Geng Yuan; Mengshu Sun; Yanyue Xie; Peiyan Dong; Xin Meng; Xuan Shen; Hao Tang; Minghai Qin; Tianlong Chen; Xiaolong Ma; Xiaohui Xie; Zhangyang Wang; Yanzhi Wang

タマネギの皮をむく: 効率的なビジョントランスフォーマートレーニングのためのデータ冗長性の階層的削減

ビジョントランスフォーマー (ViT) は最近、多くのアプリケーションで成功を収めていますが、トレーニング時と推論時の両方での集中的な計算と大量のメモリ使用により、一般化が制限されています。以前の圧縮アルゴリズムは通常、事前にトレーニングされた高密度モデルから開始し、効率的な推論のみに焦点を当てていますが、時間のかかるトレーニングは依然として避けられません。対照的に、この論文では、100 万規模のトレーニングデータが冗長であることが指摘されています。これが、退屈なトレーニングの根本的な理由です。この問題に対処するために、このホワイトペーパーでは、データにスパース性を導入することを目的とし、Tri-Level E-ViT と呼ばれる、スパースな 3 つの観点からエンドツーエンドの効率的なトレーニングフレームワークを提案します。具体的には、データセット内のトレーニング例の数、各例のパッチ (トークン) の数、および注意の重みにあるトークン間の接続の数の 3 つのレベルでスパース性を調査することにより、階層的なデータ冗長性削減スキームを活用します。広範な実験により、提案された手法が、精度を維持しながら、さまざまな ViT アーキテクチャのトレーニングを著しく加速できることを示しています。驚くべきことに、特定の比率の下では、ViT の精度を損なうのではなく、改善することができます。たとえば、Deit-T では 72.6% (+0.4) のトップ 1 精度で 15.2% のスピードアップを達成でき、Deit-S では 79.9% (+0.1) のトップ 1 精度で 15.7% のスピードアップを達成できます。これは、ViT にデータの冗長性が存在することを証明しています。

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT.

updated: Sat Nov 19 2022 21:15:47 GMT+0000 (UTC)

published: Sat Nov 19 2022 21:15:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト