AttnGrounder: Talking to Cars with Attention

Vivek Mittal

AttnGrounder：注意して車と話す

視覚的接地のタスクのための単一ステージのエンドツーエンドのトレーニング可能なモデルであるアテンショングラウンダー（AttnGrounder）を提案します。ビジュアルグラウンディングは、特定の自然言語テキストクエリに基づいて、画像内の特定のオブジェクトの位置を特定することを目的としています。すべての画像領域に同じテキスト表現を使用する以前の方法とは異なり、領域依存のテキスト表現を構築するために、指定されたクエリの各単語を対応する画像のすべての領域に関連付けるビジュアルテキストアテンションモジュールを使用します。さらに、モデルのローカリゼーション能力を向上させるために、ビジュアルテキストアテンションモジュールを使用して、参照されるオブジェクトの周囲にアテンションマスクを生成します。アテンションマスクは、提供されたグラウンドトゥルース座標で生成された長方形のマスクを使用して、補助タスクとしてトレーニングされます。 Talk2CarデータセットでAttnGrounderを評価し、既存のメソッドに比べて3.26％の改善を示しています。

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.

updated: Fri Dec 11 2020 10:00:22 GMT+0000 (UTC)

published: Fri Sep 11 2020 23:18:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト