6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning

Lu Zou; Zhangjin Huang

6D-ViT：Transformerベースのインスタンス表現学習によるカテゴリレベルの6Dオブジェクトポーズ推定

この論文では、トランスフォーマーベースのインスタンス表現学習ネットワークである6D-ViTを紹介します。これは、RGB-D画像での高精度のカテゴリレベルのオブジェクトポーズ推定に適しています。具体的には、新しい2ストリームエンコーダ-デコーダフレームワークは、RGB画像、点群、およびカテゴリ形状の事前分布から複雑で強力なインスタンス表現を探索することに専念しています。この目的のために、フレームワーク全体は、PixelformerとPointformerという名前の2つの主要なブランチで構成されています。 Pixelformerには、RGB画像からピクセル単位の外観表現を抽出するためのall-MLPデコーダーを備えたピラミッドトランスフォーマーエンコーダーが含まれています。一方、Pointformerは、カスケードトランスフォーマーエンコーダーとall-MLPデコーダーに依存して、ポイントクラウドからポイント単位の幾何学的特性を取得します。次に、密なインスタンス表現（つまり、対応行列、変形フィールド）が、入力として形状の事前情報、外観、および幾何学的情報を使用して、マルチソース集約ネットワークから取得されます。最後に、インスタンス6Dポーズは、密な表現、形状の事前分布、およびインスタンスの点群の間の対応を利用して計算されます。合成データセットと実世界のデータセットの両方での広範な実験は、提案された3Dインスタンス表現学習フレームワークが両方のデータセットで最先端のパフォーマンスを達成し、既存のすべての方法を大幅に上回っていることを示しています。

This paper presents 6D-ViT, a transformer-based instance representation learning network, which is suitable for highly accurate category-level object pose estimation on RGB-D images. Specifically, a novel two-stream encoder-decoder framework is dedicated to exploring complex and powerful instance representations from RGB images, point clouds and categorical shape priors. For this purpose, the whole framework consists of two main branches, named Pixelformer and Pointformer. The Pixelformer contains a pyramid transformer encoder with an all-MLP decoder to extract pixelwise appearance representations from RGB images, while the Pointformer relies on a cascaded transformer encoder and an all-MLP decoder to acquire the pointwise geometric characteristics from point clouds. Then, dense instance representations (i.e., correspondence matrix, deformation field) are obtained from a multi-source aggregation network with shape priors, appearance and geometric information as input. Finally, the instance 6D pose is computed by leveraging the correspondence among dense representations, shape priors, and the instance point clouds. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed 3D instance representation learning framework achieves state-of-the-art performance on both datasets, and significantly outperforms all existing methods.

updated: Sun Oct 10 2021 13:34:16 GMT+0000 (UTC)

published: Sun Oct 10 2021 13:34:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト