VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud

Ziqin Wang; Bowen Cheng; Lichen Zhao; Dong Xu; Yang Tang; Lu Sheng

VL-SAT: 点群における 3D セマンティックシーングラフ予測のための視覚言語セマンティクス支援トレーニング

ポイントクラウドでの 3D セマンティックシーングラフ (3DSSG) 予測のタスクは、(1) 3D ポイントクラウドは 2D 画像と比較して限定されたセマンティクスを持つ幾何学的構造のみをキャプチャし、(2) ロングテールリレーションシップの分布は本質的に学習を妨げるため、困難です。公平な予測の。 2D 画像は豊富なセマンティクスを提供し、シーングラフは本質的に言語に対応しているため、この研究では、3DSSG 予測モデルにロングテールとあいまいなセマンティクスを区別して大幅に強化できる Visual-Linguistic Semantics Assisted Training (VL-SAT) スキームを提案します。関係。重要なアイデアは、強力なマルチモーダルオラクルモデルをトレーニングして 3D モデルを支援することです。このオラクルは、視覚、言語、および 3D ジオメトリからセマンティクスに基づいて信頼性の高い構造表現を学習し、その利点をトレーニング段階で 3D モデルに異種混合で渡すことができます。トレーニングで視覚言語セマンティクスを効果的に利用することで、VL-SAT は SGFN や SGGpoint などの一般的な 3DSSG 予測モデルを、特にテールリレーショントリプレットを処理する場合に、推論段階での 3D 入力のみで大幅に向上させることができます。 3DSSG データセットに関する包括的な評価とアブレーション研究により、提案されたスキームの有効性が検証されました。コードは https://github.com/wz7in/CVPR2023-VLSAT で入手できます。

The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since (1) the 3D point cloud only captures geometric structures with limited semantics compared to 2D images, and (2) long-tailed relation distribution inherently hinders the learning of unbiased prediction. Since 2D images provide rich semantics and scene graphs are in nature coped with languages, in this study, we propose Visual-Linguistic Semantics Assisted Training (VL-SAT) scheme that can significantly empower 3DSSG prediction models with discrimination about long-tailed and ambiguous semantic relations. The key idea is to train a powerful multi-modal oracle model to assist the 3D model. This oracle learns reliable structural representations based on semantics from vision, language, and 3D geometry, and its benefits can be heterogeneously passed to the 3D model during the training stage. By effectively utilizing visual-linguistic semantics in training, our VL-SAT can significantly boost common 3DSSG prediction models, such as SGFN and SGGpoint, only with 3D inputs in the inference stage, especially when dealing with tail relation triplets. Comprehensive evaluations and ablation studies on the 3DSSG dataset have validated the effectiveness of the proposed scheme. Code is available at https://github.com/wz7in/CVPR2023-VLSAT.

updated: Sat Mar 25 2023 09:14:18 GMT+0000 (UTC)

published: Sat Mar 25 2023 09:14:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト