Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Georgios Tziafas; Hamidreza Kasaei

人間支援のあいまいさ解決のための視覚的接地のSimからRealへの転送

サービスロボットは、専門家ではない人間のユーザーと自然に対話できる必要があります。これは、さまざまなタスクでユーザーを支援するだけでなく、指示に存在する可能性のあるあいまいさを解決するためのガイダンスを受け取ることもできます。エージェントが自然言語の説明を与えられた混雑したシーンからオブジェクトをセグメント化する、視覚的な接地のタスクを検討します。視覚的根拠に対する現代の全体論的アプローチは、通常、言語構造を無視し、ジェネリックドメインをカバーするのに苦労しているため、大規模なデータセットに大きく依存しています。さらに、RGB-Dデータセットでの転送パフォーマンスは、ベンチマークとターゲットドメイン間の視覚的な不一致が大きいために低下します。モジュラーアプローチは、学習とドメインモデリングを組み合わせ、言語の構成的性質を利用して視覚的表現を言語解析から切り離しますが、外部パーサーに依存するか、強力な監視がないためにエンドツーエンドでトレーニングされます。この作業では、エンティティ、属性、および空間関係の構成的な視覚的接地のための完全に分離されたモジュラーフレームワークを導入することにより、これらの制限に取り組むことを目指しています。合成ドメインで生成された豊富なシーングラフアノテーションを活用し、各モジュールを個別にトレーニングします。私たちのアプローチは、シミュレーションと2つの実際のRGB-Dシーンデータセットの両方で評価されます。実験結果は、私たちのフレームワークの分離された性質が、Sim-To-Real視覚認識のためのドメイン適応アプローチとの容易な統合を可能にし、ロボットアプリケーションの視覚的接地に対するデータ効率が高く、堅牢で、解釈可能なソリューションを提供することを示しています。

Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains, therefore relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers due to high visual discrepancy between the benchmark and the target domains. Modular approaches marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these limitations by introducing a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations. We exploit rich scene graph annotations generated in a synthetic domain and train each module independently. Our approach is evaluated both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual grounding in robotic applications.

updated: Tue May 24 2022 14:12:32 GMT+0000 (UTC)

published: Tue May 24 2022 14:12:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト