RedCaps: web-curated image-text data created by the people, for the people

Karan Desai; Gaurav Kaul; Zubin Aysola; Justin Johnson

RedCaps：人々のために、人々によって作成されたWebキュレーションされた画像テキストデータ

ペアの画像とテキストの大規模なデータセットは、視覚および視覚と言語のタスクの一般的な表現を学習するためにますます人気が高まっています。このようなデータセットは、検索エンジンにクエリを実行するか、HTMLの代替テキストを収集することによって構築されています。Webデータはノイズが多いため、品質を維持するために複雑なフィルタリングパイプラインが必要です。最小限のフィルタリングで高品質のデータを収集するために、代替データソースを検討します。 RedCapsを紹介します。Redditから収集された1200万の画像とテキストのペアの大規模なデータセットです。 Redditの画像とキャプションは、さまざまなオブジェクトやシーンを描写および説明しています。手動でキュレーションされたサブレディットのセットからデータを収集します。これにより、粗い画像ラベルが付けられ、個々のインスタンスにラベルを付けることなくデータセット構成を操作できます。 RedCapsでトレーニングされたキャプションモデルが、人間が好む豊富で多様なキャプションを生成し、多くのダウンストリームタスクに転送される視覚的表現を学習することを示します。

Large datasets of paired images and text have become increasingly popular for learning generic representations for vision and vision-and-language tasks. Such datasets have been built by querying search engines or collecting HTML alt-text -- since web data is noisy, they require complex filtering pipelines to maintain quality. We explore alternate data sources to collect high quality data with minimal filtering. We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit. Images and captions from Reddit depict and describe a wide variety of objects and scenes. We collect data from a manually curated set of subreddits, which give coarse image labels and allow us to steer the dataset composition without labeling individual instances. We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks.

updated: Mon Nov 22 2021 18:59:34 GMT+0000 (UTC)

published: Mon Nov 22 2021 18:59:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト