Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

Sunan He; Taian Guo; Tao Dai; Ruizhi Qiao; Bo Ren; Shu-Tao Xia

マルチモーダル知識伝達によるオープン語彙マルチラベル分類

実世界の認識システムは、目に見えないラベルの問題に遭遇することがよくあります。このような目に見えないラベルを識別するために、マルチラベルゼロショット学習 (ML-ZSL) は、事前にトレーニングされたテキストラベル埋め込み (GloVe など) による知識の伝達に焦点を当てています。ただし、そのような方法は、言語モデルからの単一モードの知識のみを利用し、画像とテキストのペアに固有の豊富なセマンティック情報を無視します。代わりに、最近開発されたオープン語彙 (OV) ベースの方法は、オブジェクト検出で画像とテキストのペアのそのような情報を活用することに成功し、印象的なパフォーマンスを達成します。 OV ベースの方法の成功に触発されて、マルチラベル分類のためのマルチモーダル知識伝達 (MKT) という名前の新しいオープン語彙フレームワークを提案します。具体的には、私たちの方法は、ビジョンと言語の事前トレーニング (VLP) モデルに基づいて、画像とテキストのペアのマルチモーダルな知識を活用します。 VLP モデルの画像とテキストのマッチング機能の転送を容易にするために、知識の蒸留を使用して、画像とラベルの埋め込みの一貫性を保証し、ラベルの埋め込みをさらに更新するための迅速な調整を行います。複数のオブジェクトの認識をさらに有効にするために、単純で効果的な 2 ストリームモジュールが開発され、ローカルとグローバルの両方の機能がキャプチャされます。広範な実験結果は、私たちの方法が公開ベンチマークデータセットでの最先端の方法よりも大幅に優れていることを示しています。ソースコードは https://github.com/sunanhe/MKT で入手できます。

Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.

updated: Wed Feb 01 2023 10:59:03 GMT+0000 (UTC)

published: Tue Jul 05 2022 08:32:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト