Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments

Ting Wang; Zongkai Wu; Feiyu Yao; Donglin Wang

連続環境における視覚と言語のナビゲーションのためのグラフベースの環境表現

連続環境における視覚と言語のナビゲーション (VLN-CE) は、エージェントが現実的な環境で言語の指示に従う必要があるナビゲーションタスクです。環境の理解は VLN-CE タスクの重要な部分ですが、既存の方法は、言語の指示と視覚的な環境との関係を掘り下げることなく、環境を理解するのに比較的単純で直接的です。したがって、上記の問題を解決するために、新しい環境表現を提案します。まず、環境をセマンティックレベルで表現するために、物体検出による環境表現グラフ (ERG) を提案します。この操作は、言語と環境の関係を強化します。次に、ERG のオブジェクト-オブジェクト、オブジェクト-エージェントの関係表現を GCN を通じて学習し、ERG に関する連続的な表現を取得します。続いて、ERG 表現をオブジェクトラベルの埋め込みと組み合わせて、環境表現を取得します。最後に、新しいクロスモーダルなアテンションナビゲーションフレームワークが提案され、環境表現と ERG のトレーニング専用の特別な損失関数が組み込まれています。実験結果は、我々の方法がVLN-CEタスクの成功率に関して満足のいくパフォーマンスを達成することを示しています。さらなる分析により、私たちの方法がより優れたクロスモーダルマッチングと強力な一般化能力を達成することが説明されています。

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a navigation task that requires an agent to follow a language instruction in a realistic environment. The understanding of environments is a crucial part of the VLN-CE task, but existing methods are relatively simple and direct in understanding the environment, without delving into the relationship between language instructions and visual environments. Therefore, we propose a new environment representation in order to solve the above problems. First, we propose an Environment Representation Graph (ERG) through object detection to express the environment in semantic level. This operation enhances the relationship between language and environment. Then, the relational representations of object-object, object-agent in ERG are learned through GCN, so as to obtain a continuous expression about ERG. Sequentially, we combine the ERG expression with object label embeddings to obtain the environment representation. Finally, a new cross-modal attention navigation framework is proposed, incorporating our environment representation and a special loss function dedicated to training ERG. Experimental result shows that our method achieves satisfactory performance in terms of success rate on VLN-CE tasks. Further analysis explains that our method attains better cross-modal matching and strong generalization ability.

updated: Wed Jan 11 2023 08:04:18 GMT+0000 (UTC)

published: Wed Jan 11 2023 08:04:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト