Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

Xu Yan; Zhihao Yuan; Yuhao Du; Yinghong Liao; Yao Guo; Zhen Li; Shuguang Cui

構成シーン操作による点群上の包括的な視覚的質問応答

3D 点群でのビジュアル質問応答 (VQA-3D) は、点群シーン全体を対象としたさまざまな種類のテキストの質問に答えることを目的とした、新興ながらも挑戦的な分野です。この問題に取り組むために、8,771 の 3D シーンからの 171,000 の質問で構成される大規模な VQA-3D データセットである CLEVR3D を提案します。具体的には、3D シーングラフ構造を利用して、オブジェクトの属性 (サイズ、色、材質など) とその空間的関係に関する質問をカバーする、多様な推論質問を生成する質問エンジンを開発します。このような方法により、私たちは最初に 1,333 の現実世界のシーンから 44,000 の質問を生成しました。さらに、交絡的なバイアスを除去し、常識的なレイアウトからコンテキストを調整するための、より挑戦的な設定が提案されています。このような設定では、3D シーンが一般的な共起コンテキストと異なる場合 (たとえば、椅子が常にテーブルとともに存在する場合)、ネットワークが包括的な視覚的理解を達成する必要があります。この目的を達成するために、構成シーン操作戦略をさらに導入し、7,438 の拡張 3D シーンから 127,000 の質問を生成します。これにより、VQA-3D モデルが現実世界の理解を向上させることができます。提案されたデータセットに基づいて構築され、いくつかの VQA-3D モデルのベースラインが作成され、CLEVR3D が他の 3D シーンの理解タスクを大幅に向上できることが実験結果によって検証されています。私たちのコードとデータセットは https://github.com/yanx27/CLEVR3D で公開されます。

Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet challenging field that aims at answering various types of textual questions given an entire point cloud scene. To tackle this problem, we propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Through such a manner, we initially generated 44K questions from 1,333 real-world scenes. Moreover, a more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout. Such a setup requires the network to achieve comprehensive visual understanding when the 3D scene is different from the general co-occurrence context (e.g., chairs always exist with tables). To this end, we further introduce the compositional scene manipulation strategy and generate 127K questions from 7,438 augmented 3D scenes, which can improve VQA-3D models for real-world comprehension. Built upon the proposed dataset, we baseline several VQA-3D models, where experimental results verify that the CLEVR3D can significantly boost other 3D scene understanding tasks. Our code and dataset will be made publicly available at https://github.com/yanx27/CLEVR3D.

updated: Mon May 22 2023 02:55:52 GMT+0000 (UTC)

published: Wed Dec 22 2021 06:43:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト