Talk-to-Resolve: Combining scene understanding and spatial dialogue to resolve granular task ambiguity for a collocated robot

Pradip Pramanick; Chayan Sarkar; Snehasis Banerjee; Brojeshwar Bhowmick

Talk-to-Resolve：シーンの理解と空間対話を組み合わせて、併置されたロボットの詳細なタスクのあいまいさを解決します

ロボットの連結の有用性は、人間との簡単で直感的な相互作用メカニズムに大きく依存します。ロボットが自然言語でのタスク命令を受け入れる場合、まず、命令をデコードしてユーザーの意図を理解する必要があります。ただし、タスクの実行中に、観察されたシーンの変動によりロボットが予期しない状況に直面する可能性があるため、さらにユーザーの介入が必要になります。この記事では、ロボットがシーンを視覚的に観察して行き詰まりを解決することにより、インストラクターとの一貫した対話交換を開始できるようにするTalk-to-Resolve（TTR）と呼ばれるシステムを紹介します。対話を通じて、元の計画を前進させるための手がかり、元の計画の許容可能な代替案、またはタスクを完全に中止することの確認のいずれかを見つけます。膠着状態の可能性を実現するために、観察されたシーンの密なキャプションと与えられた命令を一緒に利用して、ロボットの次のアクションを計算します。最初の指示と状況シーンのペアのデータセットに基づいてシステムを評価します。私たちのシステムは膠着状態を特定し、82％の精度で適切な対話交換でそれらを解決することができます。さらに、ユーザー調査によると、最新のシステム（平均3.08）と比較して、システムからの質問はより自然です（1から5のスケールで平均4.02）。

The utility of collocating robots largely depends on the easy and intuitive interaction mechanism with the human. If a robot accepts task instruction in natural language, first, it has to understand the user's intention by decoding the instruction. However, while executing the task, the robot may face unforeseeable circumstances due to the variations in the observed scene and therefore requires further user intervention. In this article, we present a system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent dialogue exchange with the instructor by observing the scene visually to resolve the impasse. Through dialogue, it either finds a cue to move forward in the original plan, an acceptable alternative to the original plan, or affirmation to abort the task altogether. To realize the possible stalemate, we utilize the dense captions of the observed scene and the given instruction jointly to compute the robot's next action. We evaluate our system based on a data set of initial instruction and situational scene pairs. Our system can identify the stalemate and resolve them with appropriate dialogue exchange with 82% accuracy. Additionally, a user study reveals that the questions from our systems are more natural (4.02 on average on a scale of 1 to 5) as compared to a state-of-the-art (3.08 on average).

updated: Mon Nov 22 2021 10:42:59 GMT+0000 (UTC)

published: Mon Nov 22 2021 10:42:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト