Depth-based 6DoF Object Pose Estimation using Swin Transformer

Zhujun Li; Ioannis Stamos

Swin Transformer を使用した深度ベースの 6DoF オブジェクトポーズ推定

オブジェクトの 6D 姿勢を正確に推定することは、ロボットによる把持、自動運転、拡張現実などの多くのアプリケーションにとって重要です。ただし、照明条件が不十分な場合やテクスチャのないオブジェクトを扱う場合、この作業はより困難になります。この問題に対処するために、深度画像は、シーンの外観に対する不変性と、本質的な幾何学的特性の暗黙の組み込みにより、ますます一般的な選択肢になりつつあります。ただし、ポーズ推定のパフォーマンスを向上させるために深度情報を十分に活用することは、依然として困難であり、十分に調査されていない問題です。この課題に取り組むために、深度画像からの幾何学的情報のみを使用して正確な 6D 姿勢推定を実現する、SwinDePose と呼ばれる新しいフレームワークを提案します。 SwinDePose は、最初に深度画像で定義された各法線ベクトルとカメラ座標系の 3 つの座標軸の間の角度を計算します。結果の角度は、Swin Transformer を使用してエンコードされる画像に形成されます。さらに、点群から表現を学習するために RandLA-Net を適用します。結果として得られる画像と点群の埋め込みは連結され、セマンティックセグメンテーションモジュールと 3D キーポイントローカリゼーションモジュールに供給されます。最後に、ターゲットオブジェクトの予測されたセマンティックマスクと 3D キーポイントに基づいて、最小二乗法を使用して 6D ポーズを推定します。 LineMod および Occlusion LineMod データセットの実験では、SwinDePose は深度画像を使用した 6D オブジェクトの姿勢推定の既存の最先端の方法よりも優れています。これは、私たちのアプローチの有効性を実証し、現実世界のシナリオでパフォーマンスを向上させる可能性を強調しています。コードは https://github.com/zhujunli1993/SwinDePose にあります。

Accurately estimating the 6D pose of objects is crucial for many applications, such as robotic grasping, autonomous driving, and augmented reality. However, this task becomes more challenging in poor lighting conditions or when dealing with textureless objects. To address this issue, depth images are becoming an increasingly popular choice due to their invariance to a scene's appearance and the implicit incorporation of essential geometric characteristics. However, fully leveraging depth information to improve the performance of pose estimation remains a difficult and under-investigated problem. To tackle this challenge, we propose a novel framework called SwinDePose, that uses only geometric information from depth images to achieve accurate 6D pose estimation. SwinDePose first calculates the angles between each normal vector defined in a depth image and the three coordinate axes in the camera coordinate system. The resulting angles are then formed into an image, which is encoded using Swin Transformer. Additionally, we apply RandLA-Net to learn the representations from point clouds. The resulting image and point clouds embeddings are concatenated and fed into a semantic segmentation module and a 3D keypoints localization module. Finally, we estimate 6D poses using a least-square fitting approach based on the target object's predicted semantic mask and 3D keypoints. In experiments on the LineMod and Occlusion LineMod datasets, SwinDePose outperforms existing state-of-the-art methods for 6D object pose estimation using depth images. This demonstrates the effectiveness of our approach and highlights its potential for improving performance in real-world scenarios. Our code is at https://github.com/zhujunli1993/SwinDePose.

updated: Fri Mar 03 2023 18:25:07 GMT+0000 (UTC)

published: Fri Mar 03 2023 18:25:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト