MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Yang Jiao; Shaoxiang Chen; Zequn Jie; Jingjing Chen; Lin Ma; Yu-Gang Jiang

詳細：3Dシーンでの高密度キャプションのためのマルチオーダーリレーションマイニング

3D高密度キャプションは、最近提案された新しいタスクであり、点群には2D対応物よりも多くの幾何学的情報が含まれています。ただし、オブジェクト間の関係がより複雑で多様であるため、これもより困難です。既存の方法は、そのような関係をグラフ内のオブジェクト特徴学習の副産物としてのみ扱い、それらを具体的にエンコードしないため、最適な結果が得られません。この論文では、3Dシーンの複雑な関係をキャプチャして利用することにより、3D高密度キャプションを改善することを目的として、より記述的で包括的なキャプションの生成をサポートするマルチオーダーRElationマイニングモデルであるMOREを提案します。技術的には、MOREはオブジェクト関係を漸進的にエンコードします。これは、限られた数の基本的な関係から複雑な関係を推測できるためです。最初に、新しい空間レイアウトグラフ畳み込み（SLGC）を考案します。これは、3Dオブジェクトの提案上に構築されたグラフのエッジとして、いくつかの1次関係を意味的にエンコードします。次に、得られたグラフから、基本的な1次関係を基本単位としてカプセル化する複数のトリプレットをさらに抽出し、いくつかのオブジェクト中心のトリプレット注意グラフ（OTAG）を作成して、すべてのターゲットオブジェクトの複数次関係を推測します。 OTAGから更新されたノード機能は集約され、キャプションデコーダーに送られ、コンテキストオブジェクトとの多様な関係を含むキャプションを生成できるように、豊富な関係キューを提供します。 Scan2Capデータセットでの広範な実験により、提案されたMOREとそのコンポーネントの有効性が証明され、現在の最先端の方法よりも優れています。

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them, which leads to sub-optimal results. In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Technically, our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones. We first devise a novel Spatial Layout Graph Convolution (SLGC), which semantically encodes several first-order relations as edges of a graph constructed over 3D object proposals. Next, from the resulting graph, we further extract multiple triplets which encapsulate basic first-order relations as the basic unit and construct several Object-centric Triplet Attention Graphs (OTAG) to infer multi-order relations for every target object. The updated node features from OTAG are aggregated and fed into the caption decoder to provide abundant relational cues so that captions including diverse relations with context objects can be generated. Extensive experiments on the Scan2Cap dataset prove the effectiveness of our proposed MORE and its components, and we also outperform the current state-of-the-art method.

updated: Thu Mar 10 2022 07:26:15 GMT+0000 (UTC)

published: Thu Mar 10 2022 07:26:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト