Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation

Marcella Cornia; Lorenzo Baraldi; Giuseppe Fiameni; Rita Cucchiara

ユニバーサルキャプション：コンテンツスタイルの分離によるロングテールのビジョンと言語モデルのトレーニング

キャプションモデルは、自然な画像を説明する上で説得力のある結果を得ていますが、それでも現実世界の概念のロングテール分布全体をカバーしているわけではありません。このホワイトペーパーでは、Webスケールで自動的に収集されたデータセットをトレーニングすることにより、実際の概念を使用して人間のような記述を生成するタスクについて説明します。この目的のために、COCOのような従来の人間が注釈を付けたデータセットの記述スタイルを維持しながら、ノイズの多い画像とキャプションのペアを活用できるモデルを提案します。私たちのモデルは、キーワードと文体トークンの使用を通じてコンテンツをスタイルから分離し、迅速な言語モデリングという単一の目的を採用し、他の最近の提案よりも単純です。実験的に、私たちのモデルは、キャプションの品質とロングテールの概念を記述する能力の点で、ゼロショット設定でも既存の方法を一貫して上回っています。 CIDErメトリックによると、外部データを使用すると、COCOとnocapsの両方で新しい最先端技術が得られます。

While captioning models have obtained compelling results in describing natural images, they still do not cover the entire long-tail distribution of real-world concepts. In this paper, we address the task of generating human-like descriptions with in-the-wild concepts by training on web-scale automatically collected datasets. To this end, we propose a model which can exploit noisy image-caption pairs while maintaining the descriptive style of traditional human-annotated datasets like COCO. Our model separates content from style through the usage of keywords and stylistic tokens, employing a single objective of prompt language modeling and being simpler than other recent proposals. Experimentally, our model consistently outperforms existing methods in terms of caption quality and capability of describing long-tail concepts, also in zero-shot settings. According to the CIDEr metric, we obtain a new state of the art on both COCO and nocaps when using external data.

updated: Wed Nov 24 2021 19:00:05 GMT+0000 (UTC)

published: Wed Nov 24 2021 19:00:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト