MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Zachary Seymour; Kowshik Thopalli; Niluthpol Mithun; Han-Pang Chiu; Supun Samarasekera; Rakesh Kumar

MaAST：効率的なビジュアルナビゲーションのためのセマンティックトランスフォーマーによる地図の注意

自律エージェントのビジュアルナビゲーションは、コンピュータビジョンとロボット工学の分野におけるコアタスクです。深層強化学習などの学習ベースの方法は、このタスクのために開発された従来のソリューションよりも優れたパフォーマンスを発揮する可能性があります。ただし、計算負荷が大幅に増加します。この作業を通じて、既存の学習ベースのソリューションよりも優れた、または同等のパフォーマンスに焦点を当てながら、明確な時間/計算予算の下で、新しいアプローチを設計します。この目的のために、トラバース可能なパス、未踏の領域、観測されたシーンオブジェクトなどの重要なシーンセマンティクスを、RGB、深度、セマンティックセグメンテーションマスクなどの生のビジュアルストリームとともに、セマンティック情報に基づいたトップにエンコードする方法を提案します。ダウンエゴセントリックマップ表現。さらに、この情報を効果的に使用できるようにするために、成功した多層Transformerネットワークに基づいた新しい2Dマップアテンションメカニズムを紹介します。 3Dで再構築された屋内PointGoalビジュアルナビゲーションで実験を行い、アプローチの有効性を示します。新しいアテンションスキーマと補助報酬を使用してシーンセマンティクスをより有効に活用することで、エージェントのエクスペリエンスを80％削減しながら、生の入力または暗黙のセマンティクス情報のみでトレーニングされた複数のベースラインよりもパフォーマンスが優れていることを示します。

Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learning-based methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this task; however, they come at a significantly increased computational load. Through this work, we design a novel approach that focuses on performing better or comparable to the existing learning-based solutions but under a clear time/computational budget. To this end, we propose a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects -- alongside raw visual streams such as RGB, depth, and semantic segmentation masks -- into a semantically informed, top-down egocentric map representation. Further, to enable the effective use of this information, we introduce a novel 2-D map attention mechanism, based on the successful multi-layer Transformer networks. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach. We show that by using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent's experience.

updated: Sun Mar 21 2021 12:01:23 GMT+0000 (UTC)

published: Sun Mar 21 2021 12:01:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト