AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

Sudipta Paul; Amit K. Roy-Chowdhury; Anoop Cherian

AVLEN: 3D 環境における視聴覚言語のエンボディドナビゲーション

近年、具現化されたビジュアルナビゲーションは、(i) 自然言語の指示に従うように AI エージェントを装備すること、および (ii) オーディオビジュアルナビゲーションなど、ナビゲーション可能な世界をマルチモーダルにすることの 2 つの異なる方向に進歩しています。ただし、現実の世界はマルチモーダルであるだけでなく、多くの場合複雑でもあります。したがって、これらの進歩にもかかわらず、エージェントは自分の行動の不確実性を理解し、ナビゲートするための指示を求める必要があります。この目的のために、私たちは AVLEN~ を提示します -- オーディオ・ビジュアル・ランゲージ・エンボディド・ナビゲーションのためのインタラクティブなエージェントです。オーディオビジュアルナビゲーションタスクと同様に、具現化されたエージェントの目標は、3D ビジュアルワールドをナビゲートすることによってオーディオイベントをローカライズすることです。ただし、エージェントは人間 (オラクル) に助けを求めることもできます。この場合、支援は自由形式の自然言語で提供されます。これらの機能を実現するために、AVLEN はマルチモーダルな階層強化学習バックボーンを使用して学習します。(a) ナビゲーション用のオーディオキューを選択するか、オラクルをクエリする高レベルポリシー、および (b) ナビゲーションアクションに基づいて選択する低レベルポリシーその視聴覚および言語入力について。ポリシーは、オラクルへのクエリの数を最小限に抑えながら、ナビゲーションタスクの成功に対して報酬を与えることによってトレーニングされます。 AVLEN を経験的に評価するために、セマンティックオーディオビジュアルナビゲーションタスクの SoundSpaces フレームワークに関する実験を提示します。私たちの結果は、エージェントに助けを求める機能を装備することで、パフォーマンスが明らかに改善されることを示しています。特に、トレーニング中に音が聞こえない場合や注意散漫な音がある場合など、困難な場合に顕著です。

Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.

updated: Fri Oct 14 2022 16:35:06 GMT+0000 (UTC)

published: Fri Oct 14 2022 16:35:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト