COPE: End-to-end trainable Constant Runtime Object Pose Estimation

Stefan Thalhammer; Timothy Patten; Markus Vincze

COPE: エンドツーエンドのトレーニング可能なコンスタントランタイムオブジェクトポーズ推定

最先端の物体姿勢推定は、マルチモデル定式化を使用して、テスト画像内の複数のインスタンスを処理します。最初の段階として検出し、次に、2D-3D 幾何学的対応予測のために物体ごとに個別にトレーニングされたネットワークを 2 番目の段階として処理します。その後、実行時に Perspective-n-Points アルゴリズムを使用してポーズが推定されます。残念ながら、マルチモデルの定式化は遅く、関連するオブジェクトインスタンスの数にうまく対応できません。最近のアプローチは、前述の幾何学的対応から導き出された場合、直接6Dオブジェクトポーズ推定が可能であることを示しています。複数のオブジェクトの中間の幾何学的表現を学習して、テスト画像内のすべてのインスタンスの 6D ポーズを直接回帰するアプローチを提示します。固有のエンドツーエンドのトレーニング可能性により、個々のオブジェクトインスタンスを個別に処理する必要がなくなります。相互の Intersection-over-Unions を計算することにより、ポーズの仮説が個別のインスタンスにクラスター化され、オブジェクトインスタンスの数に関してごくわずかなランタイムオーバーヘッドが達成されます。複数の困難な標準データセットの結果は、ポーズ推定のパフォーマンスが、単一モデルの最先端のアプローチよりも 35 倍以上高速であるにもかかわらず優れていることを示しています。さらに、90 を超えるオブジェクトインスタンスが存在する画像のリアルタイム適用性 (>24 fps) を示す分析を提供します。さらなる結果は、6D ポーズを使用した幾何学的対応に基づくオブジェクトポーズ推定の監視の利点を示しています。

State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do not scale well with the number of object instances involved. Recent approaches show that direct 6D object pose estimation is feasible when derived from the aforementioned geometric correspondences. We present an approach that learns an intermediate geometric representation of multiple objects to directly regress 6D poses of all instances in a test image. The inherent end-to-end trainability overcomes the requirement of separately processing individual object instances. By calculating the mutual Intersection-over-Unions, pose hypotheses are clustered into distinct instances, which achieves negligible runtime overhead with respect to the number of object instances. Results on multiple challenging standard datasets show that the pose estimation performance is superior to single-model state-of-the-art approaches despite being more than ~35 times faster. We additionally provide an analysis showing real-time applicability (>24 fps) for images where more than 90 object instances are present. Further results show the advantage of supervising geometric-correspondence-based object pose estimation with the 6D pose.

updated: Mon Aug 22 2022 12:06:50 GMT+0000 (UTC)

published: Thu Aug 18 2022 12:58:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト