Self-supervised learning of object pose estimation using keypoint prediction

Zahra Gharaee; Felix Järemo Lawin; Per-Erik Forssén

キーポイント予測を使用した物体姿勢推定の自己教師あり学習

この論文では、単一画像からのオブジェクト固有の姿勢および形状予測の最近の発展について説明します。主な貢献は、カテゴリ固有の変形可能な形状の位置に対応するキーポイントの自己教師あり学習によるカメラポーズ予測への新しいアプローチです。カテゴリ固有の平均形状全体に分散された一連のキーポイントからプロキシのグラウンドトゥルースヒートマップを生成するネットワークを設計しました。各キーポイントは、ラベル付けされたテクスチャ上の一意の色で表されます。プロキシのグラウンドトゥルースヒートマップは、オンライン推論で使用できる深いキーポイント予測ネットワークをトレーニングするために使用されます。提案されたカメラ姿勢予測へのアプローチは、最先端の方法と比較して大幅な改善を示しています。カメラの姿勢予測に対する私たちのアプローチは、オンラインのビデオシーケンスの 2D 画像フレームから 3D オブジェクトを推測するために使用されます。再構成モデルをトレーニングするために、トレーニングステップごとにビデオシーケンスの 1 つのフレームからシルエットマスクとカテゴリ固有の平均オブジェクト形状のみを受け取ります。鳥のカテゴリを表す 3 つの異なるデータセット (CUB [51] 画像データセット、YouTubeVos、Davis ビデオデータセット) を使用して実験を行いました。ネットワークは CUB データセットでトレーニングされ、3 つのデータセットすべてでテストされます。オンライン実験は、CUB トレーニングセットでトレーニングされたネットワークを使用して、YouTubeVos および Davis [56] ビデオシーケンスで実証されています。

This paper describes recent developments in object specific pose and shape prediction from single images. The main contribution is a new approach to camera pose prediction by self-supervised learning of keypoints corresponding to locations on a category specific deformable shape. We designed a network to generate a proxy ground-truth heatmap from a set of keypoints distributed all over the category-specific mean shape, where each is represented by a unique color on a labeled texture. The proxy ground-truth heatmap is used to train a deep keypoint prediction network, which can be used in online inference. The proposed approach to camera pose prediction show significant improvements when compared with state-of-the-art methods. Our approach to camera pose prediction is used to infer 3D objects from 2D image frames of video sequences online. To train the reconstruction model, it receives only a silhouette mask from a single frame of a video sequence in every training step and a category-specific mean object shape. We conducted experiments using three different datasets representing the bird category: the CUB [51] image dataset, YouTubeVos and the Davis video datasets. The network is trained on the CUB dataset and tested on all three datasets. The online experiments are demonstrated on YouTubeVos and Davis [56] video sequences using a network trained on the CUB training set.

updated: Sun Feb 19 2023 19:56:50 GMT+0000 (UTC)

published: Tue Feb 14 2023 21:47:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト