Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Thao Nguyen; Gabriel Ilharco; Mitchell Wortsman; Sewoong Oh; Ludwig Schmidt

量ではなく質: データセットの設計とCLIPのロバスト性の相互作用について

Web クロールされたデータセットは、CLIP (Contrastive Language-Image pre-training) や Flamingo などの最近の画像テキストモデルで顕著な一般化機能を有効にしましたが、データセットの作成プロセスについてはほとんど知られていません。この作業では、公開されている 6 つのデータソース (YFCC、LAION、Conceptual Captions、WIT、RedCaps、Shutterstock) のテストベッドを導入して、トレーニング前の分布が CLIP の堅牢性をどのように誘発するかを調査します。トレーニング前のデータのパフォーマンスは、分布の変化によって大幅に変化し、単一のデータソースが支配的ではないことがわかりました。さらに、これらのデータソース間の相互作用を体系的に調査した結果、複数のソースを組み合わせても必ずしもより良いモデルが得られるとは限らず、最良の個々のデータソースの堅牢性が低下することがわかりました。トレーニングデータを組み合わせることでロバスト性が希薄になるという単純な設定からの理論的洞察で、経験的な調査結果を補完します。さらに、私たちの理論モデルは、LAION データセットで最近採用された CLIP ベースのデータフィルタリング技術の成功の説明の候補を提供します。全体的な結果は、Web から大量のデータを単純に収集することは、堅牢な一般化のためのトレーニング前のデータセットを構築するための最も効果的な方法ではないことを示しており、データセットの設計についてさらに研究する必要があります。コードは https://github.com/mlfoundations/clip_quality_not_quantity で入手できます。

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.

updated: Fri Oct 14 2022 08:42:41 GMT+0000 (UTC)

published: Wed Aug 10 2022 18:24:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト