OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Ke-Jyun Wang; Yun-Hsuan Liu; Hung-Ting Su; Jen-Wei Wang; Yu-Siang Wang; Winston H. Hsu; Wen-Chin Chen

OCID-Ref：クラッターシーングラウンディング用の具体化された言語を備えた3Dロボットデータセット

ロボットを作業環境に効果的に適用し、人間を支援するには、視覚的接地（VG）が閉塞物の機械性能にどのように影響するかを開発および評価することが不可欠です。ただし、現在のVG作業は、スペース使用率の問題のためにオブジェクトが通常閉塞されているオフィスや倉庫などの作業環境で制限されています。私たちの仕事では、オクルードされたオブジェクトの参照式を使用した参照式セグメンテーションタスクを特徴とする新しいOCID-Refデータセットを提案します。 OCID-Refは、2,300シーンからの305,694の参照式で構成され、RGB画像と点群の入力を提供します。困難な閉塞の問題を解決するには、2D信号と3D信号の両方を利用して、困難な閉塞の問題を解決することが重要であると主張します。私たちの実験結果は、2Dおよび3D信号を集約することの有効性を示していますが、遮蔽されたオブジェクトを参照することは、現代の視覚的接地システムにとって依然として困難です。 OCID-Refはhttps://github.com/lluma/OCID-Refで公開されています

To effectively apply robots in working environments and assist humans, it is essential to develop and evaluate how visual grounding (VG) can affect machine performance on occluded objects. However, current VG works are limited in working environments, such as offices and warehouses, where objects are usually occluded due to space utilization issues. In our work, we propose a novel OCID-Ref dataset featuring a referring expression segmentation task with referring expressions of occluded objects. OCID-Ref consists of 305,694 referring expressions from 2,300 scenes with providing RGB image and point cloud inputs. To resolve challenging occlusion issues, we argue that it's crucial to take advantage of both 2D and 3D signals to resolve challenging occlusion issues. Our experimental results demonstrate the effectiveness of aggregating 2D and 3D signals but referring to occluded objects still remains challenging for the modern visual grounding systems. OCID-Ref is publicly available at https://github.com/lluma/OCID-Ref

updated: Wed Apr 14 2021 09:03:46 GMT+0000 (UTC)

published: Sat Mar 13 2021 10:38:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト