Robust Cross-Modal Representation Learning with Progressive Self-Distillation

Alex Andonian; Shixing Chen; Raffay Hamid

プログレッシブ自己蒸留によるロバストなクロスモーダル表現学習

CLIPの視覚言語アプローチの学習目標は、Webで収集された画像キャプションデータセットに見られるノイズの多い多対多の対応を効果的に説明しておらず、計算とデータの非効率性に寄与しています。この課題に対処するために、プログレッシブ自己蒸留とソフトイメージテキストアラインメントを使用してノイズの多いデータからロバストな表現をより効率的に学習する、クロスモーダル対照学習に基づく新しいトレーニングフレームワークを紹介します。私たちのモデルは、独自の知識を抽出して、すべてのミニバッチの画像とキャプションのサブセットのソフトアラインメントターゲットを動的に生成します。これらのターゲットは、パラメーターの更新に使用されます。 14のベンチマークデータセットにわたる広範な評価は、私たちの方法が、追加の計算コストを発生させることなく、（a）ゼロショット分類、（b）線形プローブ転送、および（c）画像テキスト検索を含む複数の設定でCLIPの対応物を一貫して上回っていることを示しています。 ImageNetベースのロバスト性テストベッドを使用した分析により、私たちの方法は、ImageNetでトレーニングされたモデルとCLIP自体の両方と比較して、自然な分布シフトに対してより効果的なロバスト性を提供することがわかります。最後に、サイズが2桁に及ぶデータセットを使用した事前トレーニングは、CLIPに対する改善がトレーニング例の数に応じて拡大する傾向があることを示しています。

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data. Our model distills its own knowledge to dynamically generate soft-alignment targets for a subset of images and captions in every minibatch, which are then used to update its parameters. Extensive evaluation across 14 benchmark datasets shows that our method consistently outperforms its CLIP counterpart in multiple settings, including: (a) zero-shot classification, (b) linear probe transfer, and (c) image-text retrieval, without incurring added computational cost. Analysis using an ImageNet-based robustness test-bed reveals that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP itself. Lastly, pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples.

updated: Sun Apr 10 2022 03:28:18 GMT+0000 (UTC)

published: Sun Apr 10 2022 03:28:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト