VoxDet: Voxel Learning for Novel Instance Detection

Bowen Li; Jiashun Wang; Yaoyu Hu; Chen Wang; Sebastian Scherer

VoxDet: 新規インスタンス検出のためのボクセル学習

マルチビューテンプレートに基づいて目に見えないインスタンスを検出することは、オープンワールドの性質により困難な問題です。主に 2D 表現とマッチング技術に依存する従来の方法論では、ポーズのバリエーションやオクルージョンの処理が不十分なことがよくあります。これを解決するために、強力な 3D ボクセル表現と信頼性の高いボクセルマッチングメカニズムを完全に活用する先駆的な 3D ジオメトリ対応フレームワークである VoxDet を導入します。 VoxDet はまず、テンプレートボクセルアグリゲーション (TVA) モジュールを独創的に提案し、マルチビュー 2D 画像を 3D ボクセル特徴に効果的に変換します。関連するカメラのポーズを活用することで、これらの特徴がコンパクトな 3D テンプレートボクセルに集約されます。新しいインスタンスの検出では、このボクセル表現は、オクルージョンやポーズの変化に対する回復力の向上を示します。また、3D 再構成目標が TVA での 2D-3D マッピングの事前トレーニングに役立つことも発見しました。 2 番目に、テンプレートボクセルと迅速に位置合わせするために、VoxDet にはクエリボクセルマッチング (QVM) モジュールが組み込まれています。 2D クエリは、まず学習された 2D-3D マッピングを使用してボクセル表現に変換されます。 3D ボクセル表現はジオメトリをエンコードしているため、最初に相対回転を推定し、次に位置合わせされたボクセルを比較することができ、精度と効率の向上につながることがわかりました。要求の厳しい LineMod-Occlusion、YCB-video、および新しく構築された RoboTools ベンチマークで徹底的な実験が行われ、VoxDet は 20% 高い再現率と高速な速度でさまざまな 2D ベースラインを著しく上回りました。私たちの知る限り、VoxDet は 2D タスクに暗黙的な 3D 知識を組み込んだ最初の製品です。

Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional methodologies, which primarily rely on 2D representations and matching techniques, are often inadequate in handling pose variations and occlusions. To solve this, we introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism. VoxDet first ingeniously proposes template voxel aggregation (TVA) module, effectively transforming multi-view 2D images into 3D voxel features. By leveraging associated camera poses, these features are aggregated into a compact 3D template voxel. In novel instance detection, this voxel representation demonstrates heightened resilience to occlusion and pose variations. We also discover that a 3D reconstruction objective helps to pre-train the 2D-3D mapping in TVA. Second, to quickly align with the template voxel, VoxDet incorporates a Query Voxel Matching (QVM) module. The 2D queries are first converted into their voxel representation with the learned 2D-3D mapping. We find that since the 3D voxel representations encode the geometry, we can first estimate the relative rotation and then compare the aligned voxels, leading to improved accuracy and efficiency. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB-video, and the newly built RoboTools benchmarks, where VoxDet outperforms various 2D baselines remarkably with 20% higher recall and faster speed. To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D tasks.

updated: Fri May 26 2023 19:25:13 GMT+0000 (UTC)

published: Fri May 26 2023 19:25:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト