CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Dachuan Shi; Chaofan Tao; Anyi Rao; Zhendong Yang; Chun Yuan; Jiaqi Wang

CrossGET: 視覚言語変換を加速するための相互ガイドされたトークンのアンサンブル

視覚言語モデルは、私たちがこれまで予想していたものをはるかに超える驚異的な進歩を遂げました。ただし、急速な開発に伴って計算コストと遅延も劇的に増加しており、リソースが限られている研究者やローエンドデバイスを使用する消費者にとって、モデルの高速化が非常に重要になっています。ユニモーダルモデルについては広く研究されていますが、マルチモーダルモデル、特に視覚言語トランスフォーマーの高速化については、まだ比較的研究が進んでいません。したがって、この論文では、ユニバーサルビジョン言語の Transformer 高速化フレームワークとして Cross-Guided Ensemble of Tokens (CrossGET) を提案します。これは、オンザフライでクロスモーダルガイダンスを介して推論中に適応的にトークン数を削減し、高いレベルを維持しながら大幅なモデルの高速化につながります。パフォーマンス。具体的には、提案されている CrossGET には 2 つの主要な設計があります:1) クロスガイドマッチングとアンサンブル。 CrossGET には、クロスモーダルのガイド付きトークンマッチングとアンサンブルが組み込まれており、トークンを効果的にマージし、ごくわずかな追加パラメータを持つクロスモーダルトークンのみを導入します。 2) 完全なグラフソフトマッチング。以前の 2 部構成のソフトマッチングアプローチとは対照的に、CrossGET は、より信頼性の高いトークンマッチング結果を達成するために、効率的かつ効果的な完全グラフソフトマッチングポリシーを導入しています。さまざまなビジョン言語タスク、データセット、モデルアーキテクチャに関する広範な実験により、提案された CrossGET フレームワークの有効性と多用途性が実証されています。コードは https://github.com/sdc17/CrossGET にあります。

Vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs and latency are also dramatically growing with rapid development, making model acceleration exceedingly critical for researchers with limited resources and consumers with low-end devices. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is still relatively under-explored. Accordingly, this paper proposes Cross-Guided Ensemble of Tokens (CrossGET) as a universal vison-language Transformer acceleration framework, which adaptively reduces token numbers during inference via cross-modal guidance on-the-fly, leading to significant model acceleration while keeping high performance. Specifically, the proposed CrossGET has two key designs:1) Cross-Guided Matching and Ensemble. CrossGET incorporates cross-modal guided token matching and ensemble to merge tokens effectively, only introducing cross-modal tokens with negligible extra parameters. 2) Complete-Graph Soft Matching. In contrast to the previous bipartite soft matching approach, CrossGET introduces an efficient and effective complete-graph soft matching policy to achieve more reliable token-matching results. Extensive experiments on various vision-language tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed CrossGET framework. The code will be at https://github.com/sdc17/CrossGET.

updated: Sat May 27 2023 12:07:21 GMT+0000 (UTC)

published: Sat May 27 2023 12:07:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト