Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Muchen Li; Leonid Sigal

レファレンストランスフォーマー: マルチタスクビジュアルグラウンディングへのワンステップアプローチ

視覚的推論への重要なステップとして、視覚的グラウンディング (例えば、フレーズのローカリゼーション、参照表現の理解/セグメンテーション) が広く研究されてきました。 -ステージのセットアップ、または複雑なタスク固有の 1 ステージアーキテクチャの設計が必要です。この論文では、視覚的グラウンディングタスクのためのシンプルな 1 ステージマルチタスクフレームワークを提案します。具体的には、2 つのモダリティがビジュアルリンガルエンコーダーで融合されるトランスフォーマーアーキテクチャを活用しています。デコーダーでは、モデルはコンテキスト化された言語クエリを生成することを学習し、それをデコードして使用して、バウンディングボックスを直接回帰し、対応する参照領域のセグメンテーションマスクを生成します。このシンプルですが高度にコンテキスト化されたモデルにより、REC と RES の両方のタスクで最先端の方法を大幅に上回っています。また、単純な事前トレーニングスケジュール (外部データセットで) によってパフォーマンスがさらに向上することも示しています。広範な実験とアブレーションは、モデルがコンテキスト化された情報とマルチタスクトレーニングから大きなメリットを得ることが示されています。

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

updated: Wed Jul 14 2021 12:22:08 GMT+0000 (UTC)

published: Sun Jun 06 2021 10:53:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト