Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Yang Jiao; Zequn Jie; Jingjing Chen; Lin Ma; Yu-Gang Jiang

疑わしいオブジェクトの問題：1段階の視覚的接地に対するモデルの予測の再考

最近では、2段グラウンダーに比べて精度は同等ですが効率が大幅に高いため、1段ビジュアルグラウンダーが注目されています。ただし、オブジェクト間の関係モデリングは、1ステージグラウンダーについては十分に研究されていません。オブジェクト間の関係モデリングは重要ですが、画像内のすべてのオブジェクト間で必ずしも実行されるとは限りません。オブジェクトの一部のみがテキストクエリに関連しており、モデルを混乱させる可能性があるためです。これらのオブジェクトを「疑わしいオブジェクト」と呼びます。ただし、1段階の視覚的接地パラダイムでこれらの疑わしいオブジェクト間の関係を調査することは、2つの主要な問題があるため、重要です。（1）疑わしいオブジェクトを選択し、関係モデリングを実行するための基礎として利用できるオブジェクトの提案がない。（2）テキストクエリに関係のないオブジェクトと比較すると、疑わしいオブジェクトは、類似したセマンティクスを共有したり、特定の関係に絡まったりする可能性があり、モデルの予測を誤解しやすくなるため、混乱を招きます。上記の問題に対処するために、この論文では、疑わしいオブジェクトグラフ（SOG）アプローチを提案し、1段階の視覚的接地で疑わしいオブジェクトの中から正しい参照オブジェクトを選択するように促します。疑わしいオブジェクトは、トレーニング中にモデルの現在の識別能力に適応するノードとして、学習されたアクティベーションマップから動的に選択されます。その後、疑わしいオブジェクトに加えて、キーワード認識ノード表現モジュール（KNR）とランダム接続による探索戦略（ERC）がSOG内で同時に提案され、モデルが初期予測を再考するのに役立ちます。広範なアブレーション研究と普及している視覚的接地ベンチマークに関する最先端のアプローチとの比較は、提案された方法の有効性を示しています。

Recently, one-stage visual grounders attract high attention due to the comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all the objects within the image, as only a part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring relationships among these suspected objects in the one-stage visual grounding paradigm is non-trivial due to two core problems: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) compared with those irrelevant to the text query, suspected objects are more confusing, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. To address the above issues, this paper proposes a Suspected Object Graph (SOG) approach to encourage the correct referred object selection among the suspected ones in the one-stage visual grounding. Suspected objects are dynamically selected from a learned activation map as nodes to adapt to the current discrimination ability of the model during training. Afterward, on top of the suspected objects, a Keyword-aware Node Representation module (KNR) and an Exploration by Random Connection strategy (ERC) are concurrently proposed within the SOG to help the model rethink its initial prediction. Extensive ablation studies and comparison with state-of-the-art approaches on prevalent visual grounding benchmarks demonstrate the effectiveness of our proposed method.

updated: Thu Mar 10 2022 06:41:07 GMT+0000 (UTC)

published: Thu Mar 10 2022 06:41:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト