VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu; Renrui Zhang; Ziyu Guo; Ziyao Zeng; Yafeng Li; Guangnan Zhang

VT-CLIP: 視覚ガイド付きテキストによる視覚言語モデルの強化

Contrastive Language-Image Pre-training (CLIP) は、その伝達可能な視覚的表現学習のために最近注目を集めています。ただし、データセット内のセマンティックギャップにより、CLIP の事前トレーニング済みの画像とテキストの配置は、ダウンストリームタスクでは最適ではなくなり、転送パフォーマンスが大幅に低下します。クロスモダリティ埋め込みスペースをより適切に適応させるために、VT-CLIP という名前の Visual-guided Texts を介して CLIP を強化することを提案します。具体的には、さまざまなカテゴリのテキスト機能をガイドして、画像上の有益な領域を適応的に探索し、注意メカニズムによって視覚機能を集約します。このようにして、テキストは視覚的に導かれるようになります。つまり、下流の画像とより意味的に相関し、カテゴリごとのマッチングプロセスに大きなメリットをもたらします。少数ショットの設定では、11 のよく知られた分類データセットで VT-CLIP を評価し、その有効性を実証します。

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

updated: Thu Nov 03 2022 08:23:13 GMT+0000 (UTC)

published: Sat Dec 04 2021 18:34:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト