TransVG: End-to-End Visual Grounding with Transformers

Jiajun Deng; Zhengyuan Yang; Tianlang Chen; Wengang Zhou; Houqiang Li

TransVG：トランスフォーマーによるエンドツーエンドの視覚的接地

このホワイトペーパーでは、言語クエリを画像上の対応する領域に接地するタスクに対処するための、視覚的な接地のためのきちんとした、しかし効果的なトランスフォーマーベースのフレームワーク、つまりTransVGを紹介します。 2段階または1段階の方法を含む最先端の方法は、手動で設計されたメカニズムを備えた複雑なモジュールに依存して、クエリの推論とマルチモーダル融合を実行します。ただし、クエリ分解や画像シーングラフなど、フュージョンモジュールの設計に特定のメカニズムが関与しているため、モデルは特定のシナリオのデータセットに簡単に適合し、視覚言語コンテキスト間の豊富な相互作用が制限されます。この警告を回避するために、トランスフォーマーを活用してマルチモーダル対応を確立することを提案し、複雑な融合モジュール（モジュラーアテンションネットワーク、動的グラフ、マルチモーダルツリーなど）を次の単純なスタックに置き換えることができることを経験的に示します。より高性能なトランスエンコーダ層。さらに、視覚的根拠を直接座標回帰問題として再定式化し、候補のセット（つまり、領域の提案またはアンカーボックス）から予測を行うことを回避します。広く使用されている5つのデータセットに対して広範な実験が行われ、一連の最先端のレコードがTransVGによって設定されています。変圧器ベースの視覚的接地フレームワークのベンチマークを構築し、コードを一般に公開します。

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.

updated: Sat Apr 17 2021 13:35:24 GMT+0000 (UTC)

published: Sat Apr 17 2021 13:35:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト