Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Filip Radenovic; Abhimanyu Dubey; Abhishek Kadian; Todor Mihaylov; Simon Vandenhende; Yash Patel; Yi Wen; Vignesh Ramanathan; Dhruv Mahajan

視覚言語プレトレーニングのためのフィルタリング、蒸留、およびハードネガ

大規模なノイズデータで対照学習を使用してトレーニングされた視覚言語モデルは、ゼロショット認識問題でますます一般的になりつつあります。このホワイトペーパーでは、対照的な事前トレーニングパイプラインの次の 3 つの側面を改善します。データセットのノイズ、モデルの初期化、およびトレーニングの目的です。まず、複雑性、アクション、およびテキストスポッティング (CAT) というタイトルの単純なフィルタリング戦略を提案します。これは、データセットのサイズを大幅に削減しながら、ゼロショットビジョン言語タスク全体でパフォーマンスを向上させます。次に、Concept Distillation というタイトルのアプローチを提案して、以前の作業よりも優れたパフォーマンスを発揮しながらトレーニングの複雑さを増やさない対照的なトレーニングに強力な単峰性表現を活用します。最後に、従来の対照的なアライメントの目的を変更し、複雑さを増すことなくハードネガティブの重要性をアップサンプリングする重要性サンプリングアプローチを提案します。 29 タスクの大規模なゼロショットベンチマークで、当社の Distilled and Hard-Negative Training (DiHT) アプローチは、ベースラインと比較して 20 タスクを改善します。さらに、少数ショットの線形プロービングについては、ゼロショットと少数ショットのパフォーマンスの間のギャップを埋める新しいアプローチを提案し、以前の研究を大幅に改善します。モデルは https://github.com/facebookresearch/diht で入手できます。

Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at https://github.com/facebookresearch/diht.

updated: Thu Jan 05 2023 19:48:01 GMT+0000 (UTC)

published: Thu Jan 05 2023 19:48:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト