Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou; Aishwarya Kamath; Zhe Gan; Pengchuan Zhang; Jianfeng Wang; Linjie Li; Zicheng Liu; Ce Liu; Yann LeCun; Nanyun Peng; Jianfeng Gao; Lijuan Wang

粗いビジョンから細かいビジョン-バックボーンでの融合による言語の事前トレーニング

視覚言語（VL）の事前トレーニングは、最近かなりの注目を集めています。ただし、既存のエンドツーエンドの事前トレーニングアプローチのほとんどは、画像の高レベルの理解をテストする画像テキスト検索、視覚的質問応答（VQA）、画像キャプションなどのVLタスクに取り組むことのみを目的としているか、対象領域のみを対象としています。フレーズの接地やオブジェクトの検出などのタスクのレベルの理解。これら両方のタイプのタスクをシームレスに処理できる新しいVLモデルアーキテクチャであるFIBER（Fusion-In-the-BackboneベースのtransformER）を紹介します。 FIBERは、ユニモーダルバックボーンの後にフュージョン専用のトランスフォーマーレイヤーを用意する代わりに、画像とテキストのバックボーンにクロスアテンションを挿入することでマルチモーダルフュージョンをモデルの奥深くに押し込み、メモリとパフォーマンスの面でメリットをもたらします。さらに、画像テキストデータまたはボックスレベルの注釈付きのきめ細かいデータのみで事前トレーニングされた以前の作業とは異なり、これらの種類のデータの両方を効率的に使用する2段階の事前トレーニング戦略を提示します。 i）画像テキストデータに基づく粗粒度の事前トレーニング。続いて（ii）画像テキストボックスデータに基づくきめ細かい事前トレーニング。 VQA、画像のキャプション、検索から、フレーズの接地、表現の理解、オブジェクトの検出に至るまで、さまざまなVLタスクについて包括的な実験を行っています。 2段階の事前トレーニングと組み合わせたディープマルチモーダルフュージョンを使用することで、FIBERは、すべてのタスクにわたって強力なベースラインに対して一貫したパフォーマンスの向上を提供し、多くの場合、より多くのデータを使用する方法よりも優れています。コードはhttps://github.com/microsoft/FIBERで入手できます。

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

updated: Wed Jun 15 2022 16:41:29 GMT+0000 (UTC)

published: Wed Jun 15 2022 16:41:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト