How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen; Liunian Harold Li; Hao Tan; Mohit Bansal; Anna Rohrbach; Kai-Wei Chang; Zhewei Yao; Kurt Keutzer

CLIPはビジョンと言語のタスクにどれだけの利益をもたらすことができますか？

ほとんどの既存のVision-and-Language（V＆L）モデルは、視覚世界を認識するために、（Webクロールされたデータと比較して）比較的少数の手動注釈付きデータのセットを使用して、事前にトレーニングされた視覚エンコーダーに依存しています。ただし、大規模な事前トレーニングは通常、より良い一般化パフォーマンスをもたらす可能性があることが観察されています。たとえば、大量の画像とキャプションのペアでトレーニングされたCLIP（対照的な言語-画像の事前トレーニング）は、強力なゼロショットを示しています。さまざまなビジョンタスクの機能。 CLIPによってもたらされる利点をさらに研究するために、2つの典型的なシナリオでさまざまなV＆LモデルのビジュアルエンコーダーとしてCLIPを使用することを提案します。1）CLIPをタスク固有の微調整にプラグインする。 2）CLIPとV＆Lの事前トレーニングを組み合わせ、ダウンストリームタスクに転送します。 CLIPは、BottomUp-TopDownなど、ドメイン内の注釈付きデータでトレーニングされた、広く使用されているビジュアルエンコーダーよりも大幅に優れていることを示しています。さまざまなV＆Lタスクで競争力のある、またはより良い結果を達成すると同時に、視覚的な質問応答、視覚的な含意、およびV＆Lナビゲーションタスクで新しい最先端の結果を確立します。 https://github.com/clip-vil/CLIP-ViLでコードをリリースします。

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

updated: Tue Jul 13 2021 20:48:12 GMT+0000 (UTC)

published: Tue Jul 13 2021 20:48:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト