You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Shengkun Tang; Yaqing Wang; Zhenglun Kong; Tianchi Zhang; Yao Li; Caiwen Ding; Yanzhi Wang; Yi Liang; Dongkuan Xu

複数の出口が必要: ユニファイドビジョン言語モデルを高速化するための動的な早期出口

大規模な Transformer モデルは、統合されたアーキテクチャを使用して、さまざまなダウンストリームビジョン言語タスクを大幅に改善します。パフォーマンスの向上は、モデルサイズの増加に伴い、推論速度が遅くなり、切断のコストが増加します。一部の特定の予測では、大規模モデルの完全な複雑さから恩恵を受けますが、すべての入力を実行するために同じ量の計算が必要なわけではなく、計算リソースの浪費につながる可能性があります。この課題に対処するために、入力の複雑さの観点から計算能力を適応的に割り当てて推論効率を向上させる早期終了が提案されています。既存の早期終了戦略は通常、中間層に基づく出力の信頼性を入力の複雑さの代用として採用し、後続の層をスキップするという決定をもたらします。ただし、このような戦略は、エンコーダーでの出力信頼性の推定が難しいため、エンコーダーとデコーダーの両方を備えた広く使用されている統合アーキテクチャのエンコーダーには適用できません。エンコーダコンポーネントの早期終了を無視することは、計算能力を節約するという点で最適ではありません。この課題に対処するために、統合されたビジュアル言語モデルの新しい早期終了戦略を提案します。これにより、複数回の早期終了、つまり MuE を使用した入力レイヤーごとの類似性に関して、エンコーダーとデコーダーのレイヤーを同時に動的にスキップできます。エンコーダーで画像とテキストのモダリティを分解することにより、MuE は柔軟性があり、モダリティに関してさまざまなレイヤーをスキップできるため、パフォーマンスの低下を最小限に抑えながら推論効率を向上させることができます。 SNLI-VE および MS COCO データセットの実験では、提案されたアプローチ MuE が、それぞれ 99% および 96% のパフォーマンスを維持しながら、予想される推論時間を最大 50% および 40% 削減できることが示されています。

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely MuE. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce expected inference time by up to 50% and 40% while maintaining 99% and 96% performance respectively.

updated: Mon Nov 21 2022 02:32:25 GMT+0000 (UTC)

published: Mon Nov 21 2022 02:32:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト