Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Bichen Wu; Ruizhe Cheng; Peizhao Zhang; Peter Vajda; Joseph E. Gonzalez

最適な輸送蒸留によるデータ効率の高い言語教師ありゼロショット認識

従来のコンピュータビジョンモデルは、事前定義されたカテゴリの固定セットを予測するようにトレーニングされています。最近、自然言語は、監視された「ゴールド」ラベルよりも視覚的な概念に詳細な説明を提供する、より広く豊富な監視のソースであることが示されています。 CLIPなどの以前の作品では、InfoNCE損失を使用してモデルをトレーニングし、画像とテキストキャプションのペアリングを予測していました。ただし、CLIPはデータを大量に消費し、トレーニングには4億を超える画像とテキストのペアが必要です。非効率性は、画像とテキストのペアにノイズが多いという事実に部分的に起因している可能性があります。これに対処するために、OTTER（効率的なゼロショット認識のための最適なTransporT蒸留）を提案します。これは、オンラインエントロピー最適トランスポートを使用して、対照的な学習のラベルとしてソフトな画像とテキストの一致を見つけます。事前にトレーニングされた画像およびテキストエンコーダーに基づいて、OTTERでトレーニングされたモデルは、わずか3Mの画像テキストペアで強力なパフォーマンスを実現します。 InfoNCEの損失、ラベルの平滑化、知識の蒸留と比較して、OTTERは、Google Open Images（19,958クラス）およびTencentML-ImagesのマルチラベルImageNet10K（10032クラス）でのゼロショット評価において、これらのベースラインを一貫して上回っています。 7つの異なるデータセット/アーキテクチャ設定x6つのメトリックで42を超える評価があり、OTTERは34のすべてのベースラインを上回っています（32）または同点です（2）。

Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.

updated: Fri Dec 17 2021 11:27:26 GMT+0000 (UTC)

published: Fri Dec 17 2021 11:27:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト