Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer

Sunan He; Taian Guo; Tao Dai; Ruizhi Qiao; Bo Ren; Shu-Tao Xia

マルチモーダル知識伝達によるオープンボキャブラリーマルチラベル分類

実世界の認識システムは、実際には多くの目に見えないラベルに遭遇することがよくあります。このような目に見えないラベルを識別するために、マルチラベルゼロショット学習（ML-ZSL）は、事前にトレーニングされたテキストラベル埋め込み（GloVeなど）による知識の伝達に重点を置いています。ただし、このような方法では、画像とテキストのペアに固有の豊富なセマンティック情報を無視して、言語モデルからの単一モードの知識のみを活用します。代わりに、最近開発されたオープンボキャブラリー（OV）ベースの方法は、オブジェクト検出で画像とテキストのペアのそのような情報を活用することに成功し、印象的なパフォーマンスを実現します。 OVベースの方法の成功に触発されて、マルチラベル分類のためのマルチモーダル知識伝達（MKT）と呼ばれる新しいオープンボキャブラリーフレームワークを提案します。具体的には、私たちの方法は、ビジョンと言語の事前トレーニング（VLP）モデルに基づく画像とテキストのペアのマルチモーダル知識を活用します。 VLPモデルの画像テキスト照合機能の転送を容易にするために、知識の蒸留を使用して、画像とラベルの埋め込みの一貫性を保証し、ラベルの埋め込みをさらに更新するための迅速な調整を行います。複数のオブジェクトをさらに認識するために、ローカル機能とグローバル機能の両方をキャプチャするためのシンプルで効果的な2ストリームモジュールが開発されています。広範な実験結果は、私たちの方法が公開ベンチマークデータセットの最先端の方法を大幅に上回っていることを示しています。コードはhttps://github.com/seanhe97/MKTで入手できます。

Real-world recognition system often encounters a plenty of unseen labels in practice. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit singlemodal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pretraining (VLP) model. To facilitate transferring the imagetext matching ability of VLP model, knowledge distillation is used to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further recognize multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-theart methods on public benchmark datasets. Code will be available at https://github.com/seanhe97/MKT.

updated: Tue Jul 05 2022 08:32:18 GMT+0000 (UTC)

published: Tue Jul 05 2022 08:32:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト