Local Slot Attention for Vision-and-Language Navigation

Yifeng Zhuang; Qiang Sun; Yanwei Fu; Lifeng Chen; Xiangyang Sue

視覚と言語のナビゲーションのためのローカルスロットの注意

汎用ロボットへの道を開くことを目的としたフロンティア研究である視覚と言語のナビゲーション（VLN）は、コンピュータービジョンと自然言語処理のコミュニティで話題になっています。 VLNタスクでは、エージェントが、なじみのない環境で自然言語の指示に従って目標の場所に移動する必要があります。最近、トランスベースのモデルでVLNタスクが大幅に改善されました。トランスアーキテクチャの注意メカニズムは、視覚と言語のモーダル間およびモーダル内の情報をより適切に統合できるためです。ただし、現在のトランスベースのモデルには2つの問題があります。 1）モデルは、オブジェクトの整合性を考慮せずに、各ビューを個別に処理します。 2）視覚モダリティでの自己注意操作中に、空間的に離れたビューは、明示的な制限なしに互いに織り交ぜることができます。この種のミキシングは、有用な情報の代わりに余分なノイズを導入する可能性があります。これらの問題に対処するために、1）同じオブジェクトのセグメンテーションからの情報を組み込むためのスロットアテンションベースのモジュールを提案します。 2）視覚的注意スパンを制限するための局所的注意マスクメカニズム。提案されたモジュールは、任意のVLNアーキテクチャに簡単に接続でき、基本モデルとしてRecurrentVLN-Bertを使用します。 R2Rデータセットでの実験は、私たちのモデルが最先端の結果を達成したことを示しています。

Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.

updated: Fri Jun 17 2022 09:21:26 GMT+0000 (UTC)

published: Fri Jun 17 2022 09:21:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト