Compound Tokens: Channel Fusion for Vision-Language Representation Learning

Maxwell Mbabilla Aladago; AJ Piergiovanni

複合トークン: 視覚言語表現学習のためのチャネル融合

視覚的質問応答と視覚的含意を含むいくつかの質問応答タスクの視覚表現と言語表現を融合するための効果的な方法を提示します。ユニモーダル表現を連結するか、クロスアテンションのみを使用する以前の研究とは対照的に、チャネル融合を介してマルチモーダル表現を構成します。チャネルを融合することにより、モデルは標準的な方法と比較してより効果的にトークンを整列させることができます。複合トークンと呼ばれるこれらのマルチモーダル表現は、クロスアテンショントランスフォーマーレイヤーで生成されます。まず、クロスアテンションを通じて互換性のあるテキストトークンを取得するためのクエリとしてビジョントークンが使用されます。次に、ビジョントークンとクエリされたテキストトークンをチャネル次元に沿って連鎖させます。結果の表現を複合トークンと呼びます。複合トークンの 2 番目のグループは、テキストトークンがクロスアテンションレイヤーへのクエリとして機能する類似のプロセスを使用して生成されます。マルチモーダルエンコーダーでさらに処理するために、すべての複合トークンを連結します。オープンな語彙設定でエンドツーエンドでトレーニングされたエンコーダー/デコーダーのビジョン言語モデルを使用して、複合トークンの有効性を実証します。複合トークンは、GQA、VQA2.0、SNLI-VE など、さまざまな質問応答タスクで非常に競争力のあるパフォーマンスを実現します。

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.

updated: Fri Dec 02 2022 21:09:52 GMT+0000 (UTC)

published: Fri Dec 02 2022 21:09:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト