Exploring Predicate Visual Context in Detecting of Human-Object Interactions

Frederic Z. Zhang; Yuhui Yuan; Dylan Campbell; Zhuoyao Zhong; Stephen Gould

人間とオブジェクトの相互作用の検出における述語の視覚的コンテキストの探索

最近、DETR フレームワークが、人間とオブジェクトの相互作用 (HOI) 研究の主要なアプローチとして浮上しています。特に、2 段階の変圧器ベースの HOI 検出器は、最もパフォーマンスが高く、トレーニング効率の高いアプローチの 1 つです。ただし、これらは多くの場合、HOI 分類の条件を、きめ細かいコンテキスト情報を欠いた物体の特徴に基づいて決定し、物体のアイデンティティやボックスの端に関する視覚的な手がかりを優先して姿勢や方向の情報を回避します。これにより、当然、複雑または曖昧な相互作用の認識が妨げられます。この研究では、視覚化と慎重に設計された実験を通じてこれらの問題を研究します。したがって、クロスアテンションを通じて画像の特徴を再導入する最適な方法を調査します。改善されたクエリ設計、キーと値の広範な探索、空間ガイダンスとしてのボックスペアの位置埋め込みにより、強化された述語ビジュアルコンテキスト (PViC) を備えたモデルは、HICO-DET および V-COCO での最先端の手法を上回るパフォーマンスを発揮します。トレーニングコストを低く抑えながら、ベンチマークを向上させます。

Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.

updated: Fri Aug 11 2023 15:57:45 GMT+0000 (UTC)

published: Fri Aug 11 2023 15:57:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト