Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation

Hui Shuai; Lele Wu; Qingshan Liu

3D人間のポーズ推定のための適応型マルチビューおよび時間融合トランスフォーマー

この論文は、マルチビューおよび時間融合トランスフォーマー（MTF-トランスフォーマー）と呼ばれる統合フレームワークを提案し、3D人間ポーズ推定（HPE）でカメラキャリブレーションなしでさまざまなビュー数とビデオ長を適応的に処理します。これは、特徴抽出、マルチビューフュージングトランスフォーマー（MFT）、およびテンポラルフュージングトランスフォーマー（TFT）で構成されています。特徴抽出器は、各画像から2Dポーズを推定し、信頼度に従って予測を融合します。ポーズに焦点を合わせた機能の埋め込みを提供し、後続のモジュールを計算上軽量にします。 MFTは、さまざまな数のビューの機能を新しい相対的注意ブロックと融合させます。ビューの各ペア間の暗黙的な相対関係を適応的に測定し、より有益な機能を再構築します。 TFTはシーケンス全体の特徴を集約し、トランスを介して3Dポーズを予測します。任意の長さのビデオを適応的に処理し、時間情報を完全に統合します。トランスフォーマーの移行により、モデルは空間ジオメトリをより適切に学習し、さまざまなアプリケーションシナリオで堅牢性を維持できます。 Human3.6M、TotalCapture、およびKTH MultiviewFootballIIの定量的および定性的な結果を報告します。 MTF-Transformerは、カメラパラメータを使用した最先端の方法と比較して、競争力のある結果を取得し、任意の数の見えないビューを使用した動的キャプチャに一般化します。

This paper proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE). It consists of Feature Extractor, Multi-view Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose from each image and fuses the prediction according to the confidence. It provides pose-focused feature embedding and makes subsequent modules computationally lightweight. MFT fuses the features of a varying number of views with a novel Relative-Attention block. It adaptively measures the implicit relative relationship between each pair of views and reconstructs more informative features. TFT aggregates the features of the whole sequence and predicts 3D pose via a transformer. It adaptively deals with the video of arbitrary length and fully unitizes the temporal information. The migration of transformers enables our model to learn spatial geometry better and preserve robustness for varying application scenarios. We report quantitative and qualitative results on the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared with state-of-the-art methods with camera parameters, MTF-Transformer obtains competitive results and generalizes well to dynamic capture with an arbitrary number of unseen views.

updated: Mon Jul 04 2022 04:44:32 GMT+0000 (UTC)

published: Mon Oct 11 2021 08:57:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト