Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Heng Zhao; Joey Tianyi Zhou; Yew-Soon Ong

Word2Pix：ビジュアルグラウンディングのWord to Pixel Cross Attention Transformer

視覚的接地のための現在の1段階の方法は、視覚的特徴と融合する前に、言語クエリを1つの全体的な文の埋め込みとしてエンコードします。このような定式化は、言語を視覚的注意にモデル化するときに、クエリ文の各単語を同等に処理しないため、文の埋め込みにはそれほど重要ではないが視覚的根拠には重要な単語を無視する傾向があります。この論文では、Word2Pixを提案します。これは、エンコーダー-デコーダートランスフォーマーアーキテクチャに基づく1ステージの視覚的接地ネットワークであり、単語からピクセルへの注意を介してテキストと視覚的特徴の対応を学習できます。クエリ文からの各単語の埋め込みは、単一の全体的な文の埋め込みではなく、視覚的なピクセルに個別に注意を払うことによって同様に扱われます。このようにして、各単語には、トランスフォーマーデコーダーレイヤーの複数のスタックを介して指示対象ターゲットに注意を向けるように言語を調整する同等の機会が与えられます。 RefCOCO、RefCOCO +、およびRefCOCOgデータセットで実験を実施し、提案されたWord2Pixは、既存の1段階の方法を大幅に上回っています。得られた結果は、Word2Pixが2段階の視覚的接地モデルを上回り、同時に1段階のパラダイム、つまりエンドツーエンドのトレーニングとリアルタイムの推論速度のメリットを維持していることも示しています。

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks of transformer decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg datasets and the proposed Word2Pix outperforms existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses two-stage visual grounding models, while at the same time keeping the merits of one-stage paradigm namely end-to-end training and real-time inference speed intact.

updated: Sat Jul 31 2021 10:20:15 GMT+0000 (UTC)

published: Sat Jul 31 2021 10:20:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト