Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement

Zirui Zhao; Wee Sun Lee; David Hsu

オブジェクト配置のための口頭指示の微分可能な解析と視覚的根拠

オブジェクトを配置するために自然言語で空間関係をグラウンディングすると、あいまいさと構成性の問題が発生する可能性があります。この問題に対処するために、言語によって条件付けられたオブジェクト配置のための PARsing And Visual GrOuNding フレームワークである ParaGon を紹介します。言語の指示をオブジェクト間の関係に解析し、それらのオブジェクトを視覚的なシーンに配置します。次に、粒子ベースの GNN は、配置生成のために、接地されたオブジェクト間の関係推論を行います。 ParaGon は、エンドツーエンドのトレーニングのためにこれらすべての手順をニューラルネットワークにエンコードします。私たちのアプローチは、本質的に解析ベースの方法を確率的でデータ駆動型のフレームワークに統合します。これは、構成命令を学習するためのデータ効率が高く、一般化可能であり、ノイズの多い言語入力に対して堅牢であり、あいまいな命令の不確実性に適応します。

Grounding spatial relations in natural language for object placing could have ambiguity and compositionality issues. To address the issues, we introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement. It parses language instructions into relations between objects and grounds those objects in visual scenes. A particle-based GNN then conducts relational reasoning between grounded objects for placement generation. ParaGon encodes all of those procedures into neural networks for end-to-end training, which avoids annotating parsing and object reference grounding labels. Our approach inherently integrates parsing-based methods into a probabilistic, data-driven framework. It is data-efficient and generalizable for learning compositional instructions, robust to noisy language inputs, and adapts to the uncertainty of ambiguous instructions.

updated: Tue Oct 18 2022 12:47:47 GMT+0000 (UTC)

published: Sat Oct 01 2022 07:36:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト