OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Hugo Laurençon; Lucile Saulnier; Léo Tronchon; Stas Bekman; Amanpreet Singh; Anton Lozhkov; Thomas Wang; Siddharth Karamcheti; Alexander M. Rush; Douwe Kiela; Matthieu Cord; Victor Sanh

OBELICS: インターリーブされた画像とテキストのドキュメントのオープンな Web スケールのフィルター処理されたデータセット

画像とテキストをインターリーブする自然文書でトレーニングされた大規模なマルチモーダルモデルは、さまざまなマルチモーダルベンチマークで画像とテキストのペアでトレーニングされたモデルよりも優れたパフォーマンスを発揮します。ただし、これらのモデルのトレーニングに使用されるデータセットはリリースされておらず、収集プロセスは完全には仕様化されていません。 OBELICS データセットは、Common Crawl から抽出された 1 億 4,100 万の Web ページ、3 億 5,300 万の関連画像、および 1,150 億のテキストトークンで構成される、インターリーブされた画像とテキストのドキュメントのオープンな Web スケールのフィルター処理されたデータセットです。データセットの作成プロセスを説明し、包括的なフィルタリングルールを示し、データセットのコンテンツの分析を提供します。 OBELICS の実行可能性を示すために、IDEFICS という名前の 90 億および 800 億のパラメーターのビジョンおよび言語モデルをトレーニングし、さまざまなマルチモーダルベンチマークで競争力のあるパフォーマンスを取得します。データセット、モデル、コードをリリースします。

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

updated: Mon Aug 21 2023 09:35:52 GMT+0000 (UTC)

published: Wed Jun 21 2023 14:01:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト