Scaling Up Vision-Language Pre-training for Image Captioning

Xiaowei Hu; Zhe Gan; Jianfeng Wang; Zhengyuan Yang; Zicheng Liu; Yumao Lu; Lijuan Wang

視覚のスケールアップ-画像キャプションのための言語事前トレーニング

近年、視覚言語事前トレーニング（VLP）に基づく画像キャプションタスクのパフォーマンスが大幅に向上しました。規模はこの進歩の重要な要因であると考えられています。ただし、ほとんどの既存の作業は、約400万枚の画像で中程度のサイズ（たとえば、12層または24層）のトランスを事前トレーニングすることにのみ焦点を当てています。このホワイトペーパーでは、LargEスケールのiMageキャプティオナーであるLEMONを紹介し、画像キャプションのVLPのスケーリング動作に関する最初の実証的研究を提供します。最先端のVinVLモデルを参照モデルとして使用します。これは、画像特徴抽出器とトランスフォーマーモデルで構成され、モデルサイズが1,300万から6億7,500万パラメーターの範囲でトランスフォーマーを拡大および縮小します。データに関しては、画像のalt属性（ALT200Mと呼ばれる）に基づいてWebから自動的に収集される最大2億の画像とテキストのペアを使用して実験を行います。広範な分析は、モデルサイズとトレーニング前のデータサイズが増加するにつれて、パフォーマンスの傾向を特徴づけるのに役立ちます。また、特に大規模なノイズの多いデータのトレーニングについて、さまざまなトレーニングレシピを比較します。その結果、LEMONは、COCOキャプション、nocaps、Conceptual Captionsなど、いくつかの主要な画像キャプションベンチマークで新しい最先端技術を実現しています。また、LEMONをゼロショット方式で使用すると、ロングテールのビジュアルコンセプトでキャプションを生成できることも示しています。

In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

updated: Wed Nov 24 2021 02:30:22 GMT+0000 (UTC)

published: Wed Nov 24 2021 02:30:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト