CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Andreas Fürst; Elisabeth Rumetshofer; Johannes Lehner; Viet Tran; Fei Tang; Hubert Ramsauer; David Kreil; Michael Kopp; Günter Klambauer; Angela Bitto-Nemling; Sepp Hochreiter

CLOOB：InfoLOOBを備えた最新のホップフィールドネットワークはCLIPよりも優れています

CLIPは、ゼロショット転送学習タスクで印象的な結果をもたらし、BERTやGPT3のような基盤モデルと見なされています。豊富な表現を持つCLIPビジョンモデルは、特定のタスクで微調整される前に、InfoNCEの目的と自然言語の監視を使用して事前にトレーニングされています。 CLIPはゼロショット転送学習に優れていますが、説明の問題に悩まされています。つまり、他の関連機能を無視して、1つまたはいくつかの機能に焦点を合わせています。この問題は、元のマルチモーダルデータの共分散構造の抽出が不十分であることが原因で発生します。説明する問題に取り組むために、最新のホップフィールドネットワークを使用することをお勧めします。それらの検索された埋め込みは、保存された埋め込み内の特徴の共起から派生した強化された共分散構造を持っています。ただし、最新のホップフィールドネットワークは、学習を妨げるInfoNCE目標の飽和効果を高めます。この飽和効果を軽減するために、InfoLOOB目標を使用することを提案します。小説「ContrastiveLeaveOneOut Boost」（CLOOB）を紹介します。これは、InfoLOOBの目的とともに、共分散の強化に最新のホップフィールドネットワークを使用します。実験では、他のデータセットでのゼロショット転送学習パフォーマンスに関して、概念キャプションとYFCCデータセットで事前トレーニングした後のCLOOBとCLIPを比較します。 CLOOBは、考慮されているすべてのアーキテクチャとデータセットにわたって、ゼロショット転送学習でCLIPよりも一貫して優れています。

CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel ``Contrastive Leave One Out Boost'' (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.

updated: Mon Jun 13 2022 06:54:47 GMT+0000 (UTC)

published: Thu Oct 21 2021 17:50:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト