MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation

Zongtao He; Liuyi Wang; Shu Li; Qingqing Yan; Chengju Liu; Qijun Chen

MLANet: 継続的な視覚と言語のナビゲーションのためのサブ命令を備えたマルチレベルの注意ネットワーク

Vision-and-Language Navigation (VLN) は、言語と視覚の監視のみによって目に見えない環境をナビゲートするインテリジェントエージェントを開発することを目的としています。最近提案された連続設定 (連続 VLN) では、エージェントは自由な 3D 空間で行動する必要があり、リアルタイム実行、複雑な命令の理解、長いアクションシーケンスの予測などのより困難な課題に直面します。連続 VLN のパフォーマンスを向上させるために、マルチレベルの命令理解手順を設計し、新しいモデルである Multi-Level Attention Network (MLANet) を提案します。 MLANet の最初のステップは、サブ命令を効率的に生成することです。生の命令をサブ命令にセグメント化し、「FSASub」という名前の新しいサブ命令データセットを生成する高速サブ命令アルゴリズム (FSA) を設計します。FSA は注釈がなく、現在の方法よりも 70 倍高速です。したがって、連続 VLN のリアルタイム要件に適合. 複雑な命令理解の問題を解決するために, MLANet は命令と観察のグローバルな認識を必要とします. ビジョン、低レベルのセマンティクス、およびタスクの動的かつグローバルな理解を含む機能を生成する高レベルのセマンティクス. MLA はノイズワードの悪影響も軽減するため、命令の確実な理解が保証されます. 長い軌跡でのアクションを正しく予測するには、MLANet は次のことに焦点を当てる必要があります。現在のサブ命令の柔軟で適応的な選択を改善するために、Peak Attention Loss (PAL) を提案します. PAL はナビゲーションエージェントに利益をもたらします.ローカル情報に注意を集中することで、エージェントが最も適切なアクションを予測できるようにします。標準ベンチマークで MLANet のトレーニングとテストを行います。実験結果は、MLANet がベースラインよりも大幅に優れていることを示しています。

Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision. In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction. For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet). The first step of MLANet is to generate sub-instructions efficiently. We design a Fast Sub-instruction Algorithm (FSA) to segment the raw instruction into sub-instructions and generate a new sub-instruction dataset named ``FSASub". FSA is annotation-free and faster than the current method by 70 times, thus fitting the real-time requirement in continuous VLN. To solve the complex instruction understanding problem, MLANet needs a global perception of the instruction and observations. We propose a Multi-Level Attention (MLA) module to fuse vision, low-level semantics, and high-level semantics, which produce features containing a dynamic and global comprehension of the task. MLA also mitigates the adverse effects of noise words, thus ensuring a robust understanding of the instruction. To correctly predict actions in long trajectories, MLANet needs to focus on what sub-instruction is being executed every step. We propose a Peak Attention Loss (PAL) to improve the flexible and adaptive selection of the current sub-instruction. PAL benefits the navigation agent by concentrating its attention on the local information, thus helping the agent predict the most appropriate actions. We train and test MLANet in the standard benchmark. Experiment results show MLANet outperforms baselines by a significant margin.

updated: Thu Mar 02 2023 16:26:14 GMT+0000 (UTC)

published: Thu Mar 02 2023 16:26:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト