SeqTR: A Simple yet Universal Network for Visual Grounding

Chaoyang Zhu; Yiyi Zhou; Yunhang Shen; Gen Luo; Xingjia Pan; Mingbao Lin; Chao Chen; Liujuan Cao; Xiaoshuai Sun; Rongrong Ji

SeqTR：視覚的接地のためのシンプルでありながらユニバーサルなネットワーク

この論文では、視覚的接地タスク、例えば、フレーズのローカリゼーション、参照表現理解（REC）およびセグメンテーション（RES）のためのSeqTRと呼ばれるシンプルでありながら普遍的なネットワークを提案します。視覚的接地の標準的なパラダイムは、多くの場合、ネットワークアーキテクチャと損失関数の設計に関する実質的な専門知識を必要とし、タスク間で一般化することを困難にします。モデリングを簡素化および統合するために、画像とテキストの入力を条件とするポイント予測問題として視覚的接地をキャストします。ここでは、バウンディングボックスまたはバイナリマスクのいずれかが一連の個別の座標トークンとして表されます。このパラダイムの下で、視覚的な接地タスクは、タスク固有のブランチやヘッドなしでSeqTRネットワークに統合されます。たとえば、RESの畳み込みマスクデコーダーにより、マルチタスクモデリングの複雑さが大幅に軽減されます。さらに、SeqTRは、単純なクロスエントロピー損失ですべてのタスクに対して同じ最適化目標を共有し、手作りの損失関数を展開する複雑さをさらに軽減します。 5つのベンチマークデータセットでの実験は、提案されたSeqTRが既存の最先端技術よりも優れている（または同等である）ことを示しており、視覚的接地のためのシンプルでありながら普遍的なアプローチが実際に実現可能であることを証明しています。ソースコードはhttps://github.com/sean-zhuh/SeqTRで入手できます。

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible. Source code is available at https://github.com/sean-zhuh/SeqTR.

updated: Sun Jul 24 2022 02:13:37 GMT+0000 (UTC)

published: Wed Mar 30 2022 12:52:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト