Unified Visual Relationship Detection with Vision and Language Models

Long Zhao; Liangzhe Yuan; Boqing Gong; Yin Cui; Florian Schroff; Ming-Hsuan Yang; Hartwig Adam; Ting Liu

視覚モデルと言語モデルを使用した統合された視覚的関係の検出

この研究は、複数のデータセットからのラベル空間の結合を予測する単一の視覚的関係検出器のトレーニングに焦点を当てています。分類法に一貫性がないため、異なるデータセットにまたがるラベルを結合するのは困難になる可能性があります。この問題は、オブジェクトのペア間に二次視覚的セマンティクスが導入される場合、視覚的関係の検出において悪化します。この課題に対処するために、視覚および言語モデル (VLM) を活用した統合視覚的関係検出のための新しいボトムアップ方式である UniVRD を提案します。 VLM は、適切に調整された画像とテキストの埋め込みを提供し、セマンティックな統一のために類似の関係が互いに近くなるように最適化されます。私たちのボトムアップ設計により、モデルは物体検出と視覚的関係データセットの両方を使用したトレーニングの利点を享受できます。人間とオブジェクトのインタラクション検出とシーングラフ生成の両方に関する実験結果は、私たちのモデルの競争力のあるパフォーマンスを示しています。 UniVRD は HICO-DET で 38.07 mAP を達成し、現在最高のボトムアップ HOI 検出器を 14.26 mAP 上回ります。さらに重要なのは、統合検出器が mAP のデータセット固有のモデルと同様に機能し、モデルをスケールアップするとさらなる改善が達成されることを示しています。私たちのコードは GitHub で公開される予定です。

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub.

updated: Mon Aug 21 2023 01:04:50 GMT+0000 (UTC)

published: Thu Mar 16 2023 00:06:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト