PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Qingqing Cao; Bhargavi Paranjape; Hannaneh Hajishirzi

PuMer: 効率的なビジョン言語モデルのためのトークンのプルーニングとマージ

大規模ビジョン言語 (VL) モデルは、Transformer を使用して、入力テキストと画像間のクロスモーダルインタラクションを実行します。これらのクロスモーダルインタラクションは、入力画像とテキストの処理が 2 次的に複雑になるため、計算コストが高く、メモリを大量に消費します。 PuMer は、テキスト情報に基づくプルーニングとモダリティを意識したマージング戦略を使用して、入力画像とテキストのトークンを段階的に削減し、モデルの推論速度を向上させ、メモリフットプリントを削減するトークン削減フレームワークです。 PuMer は、VL モデルのいくつかのクロスモーダルレイヤーに軽量のトークンリデューサーモジュールを追加することで、入力テキストに関連する顕著な画像トークンを保持し、同様のテキストトークンとビジュアルトークンをマージする方法を学習します。 PuMer のトレーニングは、元の VL モデルの微調整とほとんど同じですが、より高速です。 4 つのダウンストリーム VL タスクにおける 2 つのビジョン言語モデルの評価では、PuMer が推論スループットを最大 2 倍向上させ、精度の低下が 1% 未満であるにもかかわらず、メモリフットプリントを 50% 以上削減できることがわかりました。

Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer increases inference throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.

updated: Sat May 27 2023 17:16:27 GMT+0000 (UTC)

published: Sat May 27 2023 17:16:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト