Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation

Hui Shuai; Lele Wu; Qingshan Liu

3D人間の姿勢推定のための適応型マルチビューおよび時間融合トランスフォーマー

実際のアプリケーションでは、3Dヒューマンポーズ推定（HPE）は、ビュー数、ビデオシーケンスの長さ、カメラキャリブレーションを使用するかどうかなど、いくつかの可変要素に直面しています。この目的のために、マルチビューおよび時間融合トランスフォーマー（MTF-Transformer）という名前の統合フレームワークを提案し、キャリブレーションなしでさまざまなビュー数とビデオ長を適応的に処理します。 MTF-Transformerは、Feature Extractor、Multi-view Fusing Transformer（MFT）、およびTemporal Fusing Transformer（TFT）で構成されています。特徴抽出器は、各画像から2Dポーズを推定し、予測された座標と信頼度を特徴埋め込みにエンコードして、さらに3Dポーズを推測します。画像の特徴を破棄し、2Dポーズを3Dポーズに持ち上げることに焦点を当て、後続のモジュールをビデオを処理するのに十分な計算上軽量にします。 MFTは、さまざまな数のビューの機能を相対的注意ブロックと融合します。ビューの各ペア間の暗黙的な関係を適応的に測定し、機能を再構築します。 TFTは、シーケンス全体の特徴を集約し、トランスフォーマーを介して3Dポーズを予測します。トランスフォーマーは、ビデオの長さに適応し、時間情報を最大限に活用します。これらのモジュールを使用すると、MTF-Transformerは、単眼の単一画像からマルチビュービデオまで、さまざまなアプリケーションシーンを処理し、カメラのキャリブレーションを回避できます。 Human3.6M、TotalCapture、およびKTH Multiview FootballIIで定量的および定性的な結果を示します。カメラパラメータを使用した最先端の方法と比較すると、実験では、MTF-Transformerが同等の結果を得るだけでなく、任意の数の見えないビューを使用した動的キャプチャに一般化できることが示されています。コードはhttps://github.com/lelexx/MTF-Transformerで入手できます。

In practical application, 3D Human Pose Estimation (HPE) is facing with several variable elements, involving the number of views, the length of the video sequence, and whether using camera calibration. To this end, we propose a unified framework named Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without calibration. MTF-Transformer consists of Feature Extractor, Multi-view Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature Extractor estimates the 2D pose from each image and encodes the predicted coordinates and confidence into feature embedding for further 3D pose inference. It discards the image features and focuses on lifting the 2D pose into the 3D pose, making the subsequent modules computationally lightweight enough to handle videos. MFT fuses the features of a varying number of views with a relative-attention block. It adaptively measures the implicit relationship between each pair of views and reconstructs the features. TFT aggregates the features of the whole sequence and predicts 3D pose via a transformer, which is adaptive to the length of the video and takes full advantage of the temporal information. With these modules, MTF-Transformer handles different application scenes, varying from a monocular-single-image to multi-view-video, and the camera calibration is avoidable. We demonstrate quantitative and qualitative results on the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared with state-of-the-art methods with camera parameters, experiments show that MTF-Transformer not only obtains comparable results but also generalizes well to dynamic capture with an arbitrary number of unseen views. Code is available in https://github.com/lelexx/MTF-Transformer.

updated: Mon Oct 11 2021 08:57:43 GMT+0000 (UTC)

published: Mon Oct 11 2021 08:57:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト