Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

Van-Quang Nguyen; Masanori Suganuma; Takayuki Okatani

広く見て二度解釈する: インタラクティブな指示に従うタスクのパフォーマンスを改善する

自然言語の指示に従って環境と対話しながら、具体化された AI エージェントに複雑なタスクを実行させることへのコミュニティへの関心が高まっています。最近の研究では、タスクのために適切に設計されたデータセットである ALFRED を使用して問題に取り組んでいましたが、達成された精度は非常に低いものでした。この論文では、従来の方法を大幅に上回る新しい方法を提案しています。これは、いくつかの新しいアイデアの組み合わせに基づいています。 1 つは、提供された指示の 2 段階の解釈です。この方法では、最初に視覚情報を使用せずに指示を選択して解釈し、暫定的なアクションシーケンスの予測を生成します。次に、予測を視覚情報などと統合し、アクションとオブジェクトの最終的な予測を生成します。最初の段階で相互作用するオブジェクトのクラスが特定されるため、入力画像から正しいオブジェクトを正確に選択できます。さらに、私たちの方法は、環境の複数の自己中心的な見方を考慮し、現在の命令に条件付けられた階層的注意を適用することによって本質的な情報を抽出します。これは、ナビゲーションのアクションの正確な予測に貢献します。この手法の暫定版がALFRED Challenge 2020で優勝しました.現在のバージョンでは、シングルビューで4.45%の見えない環境の成功率を達成していますが、複数ビューでさらに8.37%に改善されています。

There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment's success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views.

updated: Sun Jun 06 2021 14:38:04 GMT+0000 (UTC)

published: Tue Jun 01 2021 16:06:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト