Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

Federico Landi; Lorenzo Baraldi; Marcella Cornia; Massimiliano Corsini; Rita Cucchiara

低レベルの視覚と言語のナビゲーションのためのマルチモーダル注意ネットワーク

Vision-and-Language Navigation（VLN）は、エージェントがターゲットの宛先に到達するために言語指定のパスをたどる必要がある、やりがいのあるタスクです。エージェントが利用できるアクションが単純になり、環境との低レベルのアトミックな相互作用に移行するにつれて、目標はさらに難しくなります。この設定は、低レベルVLNの名前を取ります。この論文では、マルチモダリティ、長期的な依存関係、さまざまな機関車の設定への適応性という3つの重要な問題に取り組むことができるエージェントの作成に努めています。そのために、「知覚、変換、および行動」（PTA）を考案します。これは、繰り返しのアプローチを残した完全に注意深いVLNアーキテクチャと、自然言語、画像、低の3つの異なるモダリティを組み込んだ最初のTransformerのようなアーキテクチャです。エージェント制御のレベルアクション。特に、エンコーダーで言語情報と視覚情報を効率的にマージするために、初期の融合戦略を採用しています。次に、エージェントのアクションの履歴と知覚モダリティの間の遅い融合拡張を使用して、デコードフェーズを改良することを提案します。 2つのデータセットでモデルを実験的に検証します。PTAはR2Rの低レベルVLNで有望な結果を達成し、最近提案されたR4Rベンチマークで良好なパフォーマンスを達成します。私たちのコードはhttps://github.com/aimagelab/perceive-transform-and-actで公開されています。

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

updated: Fri Jul 30 2021 09:13:11 GMT+0000 (UTC)

published: Wed Nov 27 2019 19:00:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト