VTNet: Visual Transformer Network for Object Goal Navigation

Heming Du; Xin Yu; Liang Zheng

VTNet：オブジェクト目標ナビゲーション用のVisualTransformerネットワーク

オブジェクトゴールナビゲーションは、エージェントの観察に基づいて、エージェントをターゲットオブジェクトに向けて誘導することを目的としています。ナビゲーションアクションを決定する際には、観察されたシーンの効果的な視覚的表現を設計することが極めて重要です。この論文では、ナビゲーションにおける有益な視覚表現を学習するためのVisual Transformer Network（VTNet）を紹介します。 VTNetは、視覚的表現の2つの主要なプロパティを具体化する非常に効果的な構造です。まず、シーン内のすべてのオブジェクトインスタンス間の関係が活用されます。第二に、方向性ナビゲーション信号を学習できるように、オブジェクトと画像領域の空間的位置が強調されます。さらに、視覚的表現をナビゲーション信号に関連付けるための事前トレーニングスキームも開発し、ナビゲーションポリシーの学習を容易にします。簡単に言うと、VTNetは、オブジェクトと領域の特徴を、空間認識記述子としての位置キューとともに埋め込み、注意操作を通じてすべてのエンコードされた記述子を組み込んで、ナビゲーションの有益な表現を実現します。そのような視覚的表現が与えられると、エージェントは視覚的観察とナビゲーションアクションの間の相関関係を調査することができます。たとえば、アクティブ化マップの右側で視覚的表現が強調されている場合、エージェントは「左折」よりも「右折」を優先します。人工環境AI2-Thorでの実験は、VTNetが目に見えないテスト環境で最先端の方法を大幅に上回っていることを示しています。

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.

updated: Thu May 20 2021 01:23:15 GMT+0000 (UTC)

published: Thu May 20 2021 01:23:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト