Decoupled Spatial Temporal Graphs for Generic Visual Grounding

Qianyu Feng; Yunchao Wei; Mingming Cheng; Yi Yang

一般的な視覚的接地のための分離された時空間グラフ

視覚的接地は、その多様性と複雑さのために、視覚言語理解において長期にわたる問題です。現在の慣行は、主に静止画像または適切にトリミングされたビデオクリップで視覚的な接地を実行することに集中しています。一方、この作業では、より一般的な設定である一般的な視覚的根拠を調査し、特定の表現を満たすすべてのオブジェクトをマイニングすることを目的としています。これは、実際のシナリオではより困難でありながら実用的です。重要なことに、接地の結果は、空間と時間の両方でターゲットを正確にローカライズすることが期待されます。一方、外観とモーション機能の間でトレードオフを行うのは難しいです。実際のシナリオでは、モデルは同様の属性を持つディストラクタを区別できない傾向があります。これらの考慮事項に動機付けられて、DSTGという名前のシンプルで効果的なアプローチを提案します。これは1）空間的および時間的表現を分解して、正確な接地のために全面的な手がかりを収集します。 2）対照的な学習ルーティング戦略により、気を散らすものからの識別性と時間的一貫性を強化します。さらに、新しいビデオデータセットであるGVGについて詳しく説明します。これは、広範囲にわたるビデオを使用した難しい参照ケースで構成されています。経験的実験は、Charades-STA、ActivityNet-Caption、およびGVGデータセットにおける最先端のDSTGの優位性をよく示しています。コードとデータセットが利用可能になります。

Visual grounding is a long-lasting problem in vision-language understanding due to its diversity and complexity. Current practices concentrate mostly on performing visual grounding in still images or well-trimmed video clips. This work, on the other hand, investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression, which is more challenging yet practical in real-world scenarios. Importantly, grounding results are expected to accurately localize targets in both space and time. Whereas, it is tricky to make trade-offs between the appearance and motion features. In real scenarios, model tends to fail in distinguishing distractors with similar attributes. Motivated by these considerations, we propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding; 2) enhancing the discriminativeness from distractors and the temporal consistency with a contrastive learning routing strategy. We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos. Empirical experiments well demonstrate the superiority of DSTG over state-of-the-art on Charades-STA, ActivityNet-Caption and GVG datasets. Code and dataset will be made available.

updated: Thu Mar 18 2021 11:56:29 GMT+0000 (UTC)

published: Thu Mar 18 2021 11:56:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト