CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Haoxuan You; Mandy Guo; Zhecan Wang; Kai-Wei Chang; Jason Baldridge; Jiahui Yu

CoBIT: 対照的な双方向画像テキスト生成モデル

ビジョンと言語の分野では、事前にトレーニングされた基盤モデルが急増しています。ほとんどの既存のメソッドは、CLIP のような対比目的、PaLI のような画像からテキストへの生成目的、または Parti のようなテキストから画像への生成目的で個別に事前トレーニングされています。ただし、3 つの目的は、同じデータ、画像とテキストのペアで事前にトレーニングすることができ、対比によってグローバルなアラインメント能力が提供され、生成によってきめ細かな理解が得られるため、直感的に互いに補完し合うことができます。この作業では、3 つの事前トレーニングの目的を 1 つのフレームワークに統合しようとする Contrastive Bi-directional Image-Text Generation Model (CoBIT) を提示します。具体的には、CoBIT は、画像ユニコーダー、テキストユニコーダー、およびクロスモーダルデコーダーで構成される、新しいユニコーダーデコーダー構造を採用しています。画像/テキストユニコーダーは、さまざまなタスクでエンコードとデコードを切り替えることができるため、画像からテキストへの生成とテキストから画像への生成の両方に役立つ柔軟性と知識の共有が可能になります。 CoBIT は、画像の理解、画像とテキストの理解 (検索、キャプション、VQA、SNLI-VE)、およびテキストベースのコンテンツ作成において、特にゼロショットシナリオで優れたパフォーマンスを実現します。たとえば、ゼロショット ImageNet 分類では 82.7%、ゼロショットテキストから画像への生成では 9.37 FID スコア、ゼロショットキャプションでは 44.8 CIDEr です。

The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.

updated: Thu Mar 23 2023 17:24:31 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:24:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト