Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Xiao Xu; Chenfei Wu; Shachar Rosenman; Vasudev Lal; Nan Duan

橋塔：視覚言語表現学習におけるエンコーダー間のブリッジの構築

ツータワーアーキテクチャを備えた視覚言語（VL）モデルは、近年、視覚言語表現学習を支配してきました。現在のVLモデルは、軽量のユニモーダルエンコーダーを使用し、クロスモーダルエンコーダーで両方のモダリティを同時に抽出、整列、融合することを学習するか、セマンティックを無視して、最終層のユニモーダル機能を最上位のクロスモーダルエンコーダーに直接フィードします。ディープユニモーダルエンコーダのさまざまなレベルの情報。どちらのアプローチも、視覚言語表現の学習を制限し、モデルのパフォーマンスを制限する可能性があります。このホワイトペーパーでは、ユニモーダルエンコーダの最上位層とクロスモーダルエンコーダの各層の間に接続を構築する複数のブリッジ層を紹介します。これにより、さまざまなセマンティックレベルでの視覚的表現とテキスト表現の間の包括的なボトムアップの相互作用が可能になり、より効果的なクロスモーダルアラインメントと融合が実現します。提案されたブリッジタワーは、わずか4Mの画像で事前にトレーニングされており、さまざまなダウンストリームの視覚言語タスクで最先端のパフォーマンスを実現します。 VQAv2 test-stdセットでは、Bridge-Towerは78.73％の精度を達成し、同じ事前トレーニングデータを使用し、追加のパラメーターや計算コストをほとんどかけずに、以前の最先端のMETERモデルを1.09％上回っています。特に、モデルをさらにスケーリングすると、Bridge-Towerは81.15％の精度を達成し、桁違いに大きなデータセットで事前トレーニングされたモデルを上回ります。コードはhttps://github.com/microsoft/BridgeTowerで入手できます。

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Both approaches possibly restrict vision-language representation learning and limit model performance. In this paper, we introduce multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-modal alignment and fusion. Our proposed Bridge-Tower, pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks. On the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art METER model by 1.09% with the same pre-training data and almost no additional parameters and computational cost. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code is available at https://github.com/microsoft/BridgeTower.

updated: Fri Jun 17 2022 09:42:35 GMT+0000 (UTC)

published: Fri Jun 17 2022 09:42:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト