HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Shan Ning; Longtian Qiu; Yongfei Liu; Xuming He

HOICLIP: 視覚言語モデルを使用した HOI 検出のための効率的な知識伝達

ヒューマンオブジェクトインタラクション (HOI) 検出は、人間とオブジェクトのペアをローカライズし、それらの相互作用を認識することを目的としています。最近、Contrastive Language-Image Pre-training (CLIP) は、知識の蒸留を介して HOI 検出器に事前に相互作用を提供する大きな可能性を示しています。ただし、このようなアプローチは多くの場合、大規模なトレーニングデータに依存しており、少数またはゼロショットのシナリオではパフォーマンスが低下します。この論文では、CLIPから事前知識を効率的に抽出し、より良い一般化を達成する新しいHOI検出フレームワークを提案します。詳細には、クロスアテンションメカニズムを介してCLIPの視覚的特徴マップの有益な領域を抽出するための新しい相互作用デコーダーを最初に導入し、その後、知識統合ブロックによって検出バックボーンと融合して、より正確な人間とオブジェクトのペアを検出します。さらに、CLIP テキストエンコーダーの事前知識を活用して、HOI 記述を埋め込むことで分類器を生成します。きめの細かい相互作用を区別するために、視覚的なセマンティック演算と軽量の動詞表現アダプターを使用して、トレーニングデータから動詞分類子を構築します。さらに、CLIP からのグローバルな HOI 予測を活用するためのトレーニング不要の拡張を提案します。広範な実験により、HICO-Det で +4.04 mAP など、さまざまな設定で、この方法が最新技術よりも大幅に優れていることが実証されています。ソースコードは https://github.com/Artanic30/HOICLIP で入手できます。

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

updated: Wed Mar 29 2023 01:53:04 GMT+0000 (UTC)

published: Tue Mar 28 2023 07:54:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト