Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang; Bhishma Dedhia; Niraj K. Jha

Zero-TPrune: 事前トレーニング済みトランスフォーマーのアテンショングラフを活用したゼロショットトークンプルーニング

モデルのサイズが指数関数的に増大し、入力シーケンス内のトークンの数に応じて二次関数的に増加する推論コストにより、エッジでの Transformer モデルのデプロイはますます困難になっています。トークンプルーニングは、さまざまな Transformer バックボーンへの展開が容易であるため、この課題に対処するための新しいソリューションです。ただし、ほとんどのトークンプルーニング手法では、プルーニング後またはプルーニング中に、計算量の多い微調整プロセスが必要となり、多くの場合、これは望ましくありません。最近の作品の中には、微調整を行わずに、既製の事前トレーニング済みトランスフォーマーの枝刈りを検討しているものもあります。ただし、それらはトークンの重要性のみを考慮しています。この研究では、トークンプルーニングを実行する際にトークンの重要性と類似性の両方を考慮する初のゼロショット手法である Zero-TPrune を提案します。 Zero-TPrune は、事前トレーニングされた Transformer モデルのアテンショングラフを活用して、トークンの重要度ランクを生成し、情報量の少ないトークンを削除します。アテンション行列は、有向グラフの隣接行列として考えることができ、これにグラフシフト演算子を繰り返し適用して重要度スコア分布を取得できます。この分布は、トークンを 2 つのグループに分割し、それらの間の類似性を測定します。微調整のオーバーヘッドがなくなるため、Zero-TPrune は大規模なモデルを簡単にプルーニングし、ハイパーパラメータ調整を効率的に実行できます。 Zero-TPrune をさまざまなビジョン Transformer バックボーンに適用することで、ビジョンタスクにおける Zero-TPrune のパフォーマンスを評価します。微調整が必要な最先端の枝刈り手法と比較して、Zero-TPrune は枝刈り後の微調整が不要なだけでなく、わずか約 0.3% の精度損失で済みます。最先端の微調整不要の枝刈り手法と比較して、Zero-TPrune は中型モデルで精度の損失を最大 45% 削減します。

Deployment of Transformer models on the edge is increasingly challenging due to the exponentially growing model size and inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require a computationally-expensive fine-tuning process after or during pruning, which is not desirable in many cases. Some recent works explore pruning of off-the-shelf pre-trained Transformers without fine-tuning. However, they only take the importance of tokens into consideration. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. Zero-TPrune leverages the attention graph of pre-trained Transformer models to produce an importance rank for tokens and removes the less informative tokens. The attention matrix can be thought of as an adjacency matrix of a directed graph, to which a graph shift operator can be applied iteratively to obtain the importance score distribution. This distribution guides the partition of tokens into two groups and measures similarity between them. Due to the elimination of the fine-tuning overhead, Zero-TPrune can easily prune large models and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning, but does so with only around 0.3% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods, Zero-TPrune reduces accuracy loss by up to 45% on medium-sized models.

updated: Sat May 27 2023 02:08:51 GMT+0000 (UTC)

published: Sat May 27 2023 02:08:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト