The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation

Yuankai Qi; Zizheng Pan; Yicong Hong; Ming-Hsuan Yang; Anton van den Hengel; Qi Wu

知る道-場所：屋内ビジョン-言語ナビゲーションのためのオブジェクトと部屋の情報に基づくシーケンシャルBERT

Vision-and-Language Navigation（VLN）では、エージェントが自然言語の指示と一連の写実的なパノラマに基づいて遠隔地へのパスを見つける必要があります。ほとんどの既存の方法では、説明の単語と各パノラマの個別のビューをエンコードの最小単位として使用します。ただし、これには、同じ入力ビュー機能に対して異なる名詞（TV、テーブルなど）を一致させるモデルが必要です。この作業では、同じきめ細かいレベル、つまりオブジェクトと単語で視覚と言語の指示をエンコードするために、オブジェクトに基づいたシーケンシャルBERTを提案します。私たちのシーケンシャルBERTは、マルチラウンドVLNタスクに不可欠な時間的コンテキストに照らして視覚的テキストの手がかりを解釈することも可能にします。さらに、モデルは、ナビゲート可能な各場所の相対的な方向（たとえば、左/右/前/後ろ）と、現在および最終のナビゲーション目標の部屋のタイプ（たとえば、寝室、キッチン）を識別できるようにします。このような情報は広く存在するためです。目的の次の最終的な場所を暗示する指示に記載されています。したがって、モデルは、オブジェクトが画像のどこにあるか、そしてオブジェクトがシーンのどこにあるかを知ることができます。広範な実験により、3つの屋内VLNタスク（REVERIE、NDH、およびR2R）でのいくつかの最先端の方法と比較した有効性が実証されています。プロジェクトリポジトリ：https：//github.com/YuankaiQi/ORIST

Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas. Most existing methods take the words in the instructions and the discrete views of each panorama as the minimal unit of encoding. However, this requires a model to match different nouns (e.g., TV, table) against the same input view feature. In this work, we propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level, namely objects and words. Our sequential BERT also enables the visual-textual clues to be interpreted in light of the temporal context, which is crucial to multi-round VLN tasks. Additionally, we enable the model to identify the relative direction (e.g., left/right/front/back) of each navigable location and the room type (e.g., bedroom, kitchen) of its current and final navigation goal, as such information is widely mentioned in instructions implying the desired next and final locations. We thus enable the model to know-where the objects lie in the images, and to know-where they stand in the scene. Extensive experiments demonstrate the effectiveness compared against several state-of-the-art methods on three indoor VLN tasks: REVERIE, NDH, and R2R. Project repository: https://github.com/YuankaiQi/ORIST

updated: Wed Aug 25 2021 08:54:25 GMT+0000 (UTC)

published: Fri Apr 09 2021 02:44:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト