Less is More: Generating Grounded Navigation Instructions from Landmarks

Su Wang; Ceslee Montgomery; Jordi Orbay; Vighnesh Birodkar; Aleksandra Faust; Izzeddin Gur; Natasha Jaques; Austin Waters; Jason Baldridge; Peter Anderson

少ないほど多い：ランドマークからの接地されたナビゲーション命令の生成

屋内ルートで撮影された360度画像からのナビゲーション命令の自動生成を研究しています。既存のジェネレーターは視覚的な接地が不十分であるため、言語の事前確率に依存してオブジェクトを幻覚化します。当社のMARKY-MT5システムは、視覚的なランドマークに焦点を当てることでこれに対処します。これは、第1段階のランドマーク検出器と第2段階のジェネレーター（マルチモーダル、多言語、マルチタスクのエンコーダーデコーダー）で構成されます。それをトレーニングするために、Room-across-Room（RxR）データセットの上に接地されたランドマーク注釈をブートストラップします。テキストパーサー、RxRのポーズトレースからの弱い監視、および1.8b画像でトレーニングされた多言語画像テキストエンコーダーを使用して、971kの英語、ヒンディー語、テルグ語のランドマークの説明を識別し、パノラマの特定の領域に固定します。部屋から部屋への移動では、人間のウェイファインダーは、MARKY-MT5の指示に従って71％の成功率（SR）を取得します。これは、人間の指示に従って75％のSRを恥ずかしがり、他のジェネレーターのSRをはるかに上回ります。 RxRのより長く多様なパスを評価すると、3つの言語で61〜64％のSRが得られます。新しい環境でこのような高品質のナビゲーション命令を生成することは、会話型ナビゲーションツールに向けた一歩であり、命令に従うエージェントの大規模なトレーニングを容易にする可能性があります。

We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.

updated: Mon Apr 04 2022 21:21:27 GMT+0000 (UTC)

published: Thu Nov 25 2021 02:20:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト