Learning Visual Representations via Language-Guided Sampling

Mohamed El Banani; Karan Desai; Justin Johnson

言語ガイド付きサンプリングによる視覚表現の学習

オブジェクトは多くのコンテキストで表示される可能性がありますが、多くの場合、限られた数の方法でオブジェクトを記述します。言語によって視覚的な変化を抽象化し、概念を表現して伝達することができます。この直感に基づいて、視覚表現学習への代替アプローチを提案します。言語の類似性を使用して、意味的に類似した画像ペアを対照学習のためにサンプリングします。私たちのアプローチは、手作りの拡張や学習したクラスターの代わりに言語の類似性を使用してビューのペアをサンプリングすることにより、画像ベースの対照的な学習から分岐します。また、私たちのアプローチは、クロスモーダル損失を直接最小化するのではなく、事前にトレーニングされた言語モデルに依存して学習を導くという点で、画像とテキストの対照学習とは異なります。一連の実験を通じて、言語ガイド付き学習は、画像ベースおよび画像テキスト表現学習アプローチよりも優れた機能をもたらすことを示しています。

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.

updated: Wed Mar 29 2023 10:23:40 GMT+0000 (UTC)

published: Thu Feb 23 2023 18:59:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト