T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Pratyush Maini; Sachin Goyal; Zachary C. Lipton; J. Zico Kolter; Aditi Raghunathan

T-MARS: テキスト特徴学習を回避して視覚表現を改善する

ウェブソースの大規模なマルチモーダルデータセットは、汎用の視覚表現を学習するための多数の新しい方法を強化し、コンピュータービジョンの最先端を進歩させ、ゼロショットおよび少数ショットの認識に革命をもたらしました。実務家が直面している重要な決断の 1 つは、これらのますます大規模なデータセットを管理する場合、どのように管理するかということです。たとえば、LAION-5B データセットの作成者は、CLIP 類似性スコアが指定されたしきい値を超えた画像とキャプションのペアのみを保持することを選択しました。この論文では、LAION の画像の 40% 近くにキャプションと大きく重なるテキストが含まれているという観察に動機付けられた、新しい最先端のデータフィルタリングアプローチを提案します。直感的には、このようなデータはモデルに視覚的特徴を学習させるのではなく光学式文字認識を実行させるようになるため、無駄になる可能性があります。ただし、そのようなデータをすべて単純に削除すると、(重複するテキストに加えて) 視覚的な特徴を含む画像が破棄されるため、無駄になる可能性があります。当社のシンプルでスケーラブルなアプローチである T-MARS (テキストマスキングと再スコアリング) は、最初にテキストをマスクし、次に CLIP 類似性スコアの低いものをフィルタリングすることにより、テキストが残りの視覚的特徴を支配するペアのみをフィルタリングします。マスクされた画像の。実験的には、T-MARS は、DataComp (データフィルタリングベンチマーク) の「中規模」でトップランクの手法よりも、ImageNet では 6.5%、VTAB では 4.7% のマージンで優れています。さらに、2M から 64M までのさまざまなデータプールサイズに関する系統的な評価では、データとコンピューティングが指数関数的にスケールされるにつれて、T-MARS による精度の向上が直線的に増加することが示されています。コードは https://github.com/locuslab/T-MARS で入手できます。

Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.

updated: Thu Jul 06 2023 16:59:52 GMT+0000 (UTC)

published: Thu Jul 06 2023 16:59:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト