Multi-Tailed Vision Transformer for Efficient Inference

Yunke Wang; Bo Du; Wenyuan Wang; Chang Xu

効率的な推論のためのマルチテールビジョントランスフォーマー

最近、Vision Transformer（ViT）は画像認識で有望なパフォーマンスを達成し、さまざまなビジョンタスクの強力なバックボーンとして徐々に機能しています。 Transformerの順次入力を満たすために、ViTのテールは、最初に各画像を固定長のビジュアルトークンのシーケンスに分割します。次に、次の自己注意レイヤーがトークン間のグローバルな関係を構築して、ダウンストリームタスクの有用な表現を生成します。経験的に、より多くのトークンで画像を表現するとパフォーマンスが向上しますが、トークンの数に対する自己注意レイヤーの2次計算の複雑さは、ViTの推論の効率に深刻な影響を与える可能性があります。計算量を減らすために、いくつかのプルーニング方法では、トランスフォーマーの前のトークンの数を変更せずに、トランスフォーマーエンコーダーで情報のないトークンを段階的にプルーニングします。実際、Transformerエンコーダーの入力としてトークンを少なくすると、次の計算コストを直接削減できます。この精神で、私たちは論文でマルチテールビジョントランスフォーマー（MT-ViT）を提案します。 MT-ViTは、複数のテールを採用して、次のTransformerエンコーダーのさまざまな長さのビジュアルシーケンスを生成します。テール予測子は、画像が正確な予測を生成するためにどのテールが最も効率的であるかを決定するために導入されています。両方のモジュールは、Gumbel-Softmaxトリックを使用して、エンドツーエンドで最適化されています。 ImageNet-1Kでの実験は、MT-ViTが精度を低下させることなくFLOPを大幅に削減し、精度とFLOPの両方で他の比較方法よりも優れていることを示しています。

Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform other compared methods in both accuracy and FLOPs.

updated: Mon Mar 18 2024 14:32:54 GMT+0000 (UTC)

published: Thu Mar 03 2022 09:30:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト