Neighbor-view Enhanced Model for Vision and Language Navigation

Dong An; Yuankai Qi; Yan Huang; Qi Wu; Liang Wang; Tieniu Tan

視覚と言語ナビゲーションのためのネイバービュー拡張モデル

視覚と言語のナビゲーション（VLN）では、エージェントが自然言語の指示に従って目的の場所に移動する必要があります。既存の作品のほとんどは、候補が存在する対応する単一ビューの機能によってナビゲーション候補を表します。ただし、指示では、単一ビューのランドマークを参照として言及する場合があり、既存の方法のテキストと視覚のマッチングが失敗する可能性があります。。この作業では、マルチモジュールの隣接ビュー拡張モデル（NvEM）を提案して、隣接ビューからの視覚コンテキストを適応的に組み込み、テキストと視覚のマッチングを向上させます。具体的には、NvEMはサブジェクトモジュールとリファレンスモジュールを利用して、ネイバービューからコンテキストを収集します。サブジェクトモジュールはグローバルレベルでネイバービューを融合し、リファレンスモジュールはローカルレベルでネイバーオブジェクトを融合します。主題と参照は、注意メカニズムを介して適応的に決定されます。私たちのモデルには、指示で強力なオリエンテーションガイダンス（「左折」など）を利用するためのアクションモジュールも含まれています。各モジュールはナビゲーションアクションを個別に予測し、それらの加重和が最終アクションの予測に使用されます。広範な実験結果は、いくつかの最先端のナビゲーターに対するR2RおよびR4Rベンチマークでの提案された方法の有効性を示しており、NvEMはいくつかの事前トレーニングのものよりも優れています。私たちのコードはhttps://github.com/MarSaKi/NvEMで入手できます。

Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., ``turn left'') in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

updated: Mon Jul 19 2021 11:10:21 GMT+0000 (UTC)

published: Thu Jul 15 2021 09:11:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト