DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

Ping Hu; Ximeng Sun; Stan Sclaroff; Kate Saenko

DualCoOp++: 限定された注釈による複数ラベル認識への迅速かつ効果的な適応

低ラベル領域でのマルチラベル画像認識は、非常に困難であり、実用的に重要な課題です。これまでの研究では、限られた画像ラベルを補うためにテキスト空間と視覚空間の間の位置合わせを学習することに焦点を当ててきましたが、高品質のマルチラベル注釈が不足しているため、精度が低下する可能性があります。この研究では、何百万もの補助的な画像とテキストのペアを使用して事前トレーニングされた、テキスト特徴と視覚特徴の間の強力な調整を活用します。私たちは、部分ラベル認識とゼロショット複数ラベル認識に対処するための統一アプローチとして機能する、証拠に基づくデュアルコンテキスト最適化 (DualCoOp++) と呼ばれる効率的かつ効果的なフレームワークを導入します。 DualCoOp++ では、言語入力 (つまり、プロンプト) のパラメトリックコンポーネントとして、ターゲットクラスの証拠、肯定的、否定的なコンテキストを個別にエンコードします。証拠コンテキストは、ターゲットクラスに関連するすべてのビジュアルコンテンツを発見することを目的としており、画像の空間領域から肯定的なコンテキストと否定的なコンテキストを集約するためのガイダンスとして機能し、類似したカテゴリをより適切に区別できるようにします。さらに、追加のパラメーターとコストの必要性を回避しながら、トレーニング中のクラス間の対話を促進する Winner-Take-All モジュールを導入します。 DualCoOp++ は、事前トレーニングされたビジョン言語フレームワークに最小限の学習可能な追加オーバーヘッドを課すため、限られた注釈や未確認のクラスを含む複数ラベル認識タスクに迅速に適応できます。 2 つの困難な低ラベル設定にわたる標準的なマルチラベル認識ベンチマークの実験により、最先端の方法と比較して、私たちのアプローチの優れたパフォーマンスが実証されました。

Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

updated: Thu Dec 14 2023 02:19:42 GMT+0000 (UTC)

published: Thu Aug 03 2023 17:33:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト