Make A Long Image Short: Adaptive Token Length for Vision Transformers

Yichen Zhu; Yuqin Zhu; Jie Du; Yi Wang; Zhicai Ou; Feifei Feng; Jian Tang

長い画像を短くする：ビジョントランスフォーマーの適応トークン長

ビジョントランスフォーマーは、各画像を固定長のトークンのシーケンスに分割し、自然言語処理の単語と同じ方法でトークンを処理します。通常、トークンが多いほどパフォーマンスは向上しますが、計算コストが大幅に増加します。「絵は千の言葉に値する」ということわざに動機付けられて、私たちは長い画像を短くすることによってViTモデルを加速することを目指しています。この目的のために、推論中にトークンの長さを適応的に割り当てるための新しいアプローチを提案します。具体的には、最初にResizable-ViT（ReViT）と呼ばれるViTモデルをトレーニングします。このモデルは、さまざまなトークン長の任意の入力を処理できます。次に、ReViTから「トークン長ラベル」を取得し、それを使用して軽量のトークン長アサイナ（TLA）をトレーニングします。トークン長ラベルは、ReViTが正しい予測を行うことができる画像を分割するためのトークンの最小数であり、TLAは、これらのラベルに基づいて最適なトークン長を割り当てるように学習されます。 TLAを使用すると、ReViTは、推論中に必要最小限のトークン数で画像を処理できます。したがって、ViTモデルのトークン番号を減らすことで、推論速度が向上します。私たちのアプローチは一般的であり、最新のビジョントランスアーキテクチャと互換性があり、計算の広がりを大幅に減らすことができます。 2つのタスク（画像分類とアクション認識）にわたって、複数の代表的なViTモデル（DeiT、LV-ViT、およびTimesFormer）でメソッドの有効性を検証しました。

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).

updated: Fri Dec 03 2021 02:48:51 GMT+0000 (UTC)

published: Fri Dec 03 2021 02:48:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト