VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang; Xiujun Li; Xiaowei Hu; Jianwei Yang; Lei Zhang; Lijuan Wang; Yejin Choi; Jianfeng Gao

VinVL：視覚言語モデルにおける視覚表現の再考

この論文は、視覚言語（VL）タスクの視覚表現を改善する詳細な研究を提示し、画像のオブジェクト中心の表現を提供するための改善されたオブジェクト検出モデルを開発します。最も広く使用されているボトムアップおよびトップダウンモデルanderson2018bottomと比較すると、新しいモデルはより大きく、VLタスク用に設計されており、複数のパブリック注釈付きオブジェクト検出データセットを組み合わせたはるかに大きなトレーニングコーパスで事前トレーニングされています。したがって、視覚的なオブジェクトや概念のより豊富なコレクションの表現を生成できます。以前のVLの研究は、主に視覚と言語の融合モデルの改善に焦点を当てており、オブジェクト検出モデルの改善はそのままですが、VLモデルでは視覚的特徴が非常に重要であることを示しています。私たちの実験では、新しいオブジェクト検出モデルによって生成された視覚的特徴をTransformerベースのVL融合モデル\ oscar li2020oscarにフィードし、改良されたアプローチ\ short \を利用してVLモデルを事前トレーニングし、ワイドで微調整しますダウンストリームVLタスクの範囲。私たちの結果は、新しい視覚的機能がすべてのVLタスク全体のパフォーマンスを大幅に改善し、7つの公開ベンチマークで新しい最先端の結果を作成することを示しています。新しい物体検出モデルを公開します。

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model anderson2018bottom, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar li2020oscar, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

updated: Wed Mar 10 2021 01:27:16 GMT+0000 (UTC)

published: Sat Jan 02 2021 23:35:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト