Lifting Transformer for 3D Human Pose Estimation in Video

Wenhao Li; Hong Liu; Runwei Ding; Mengyuan Liu; Pichao Wang

ビデオでの3D人間の姿勢推定のためのリフティングトランスフォーマー

ビデオベースの3D人間ポーズ推定の大きな進歩にもかかわらず、冗長なシーケンスから識別可能な単一ポーズ表現を学習することは依然として困難です。この目的のために、2D関節位置のシーケンスを3Dポーズに持ち上げるための、3D人間ポーズ推定用のLiftingTransformerと呼ばれる新しいTransformerベースのアーキテクチャを提案します。具体的には、バニラトランスフォーマーエンコーダー（VTE）を採用して、2Dポーズシーケンスの長距離依存性をモデル化します。シーケンスの冗長性を減らし、ローカルコンテキストからの情報を集約するために、VTEのフィードフォワードネットワーク内の完全に接続されたレイヤーは、シーケンスの長さを徐々に減らすためにストライド畳み込みに置き換えられます。変更されたVTEは、ストライドトランスフォーマーエンコーダー（STE）と呼ばれ、VTEの出力に基づいて構築されます。 STEは、計算コストを大幅に削減するだけでなく、情報をグローバルおよびローカルの方法で単一のベクトル表現に効果的に集約します。さらに、フルシーケンススケールとシングルターゲットフレームスケールの両方でフルツーシングル監視方式が採用されており、それぞれVTEとSTEの出力に適用されます。このスキームは、単一のターゲットフレーム監視と組み合わせて追加の時間的平滑性制約を課します。提案されたアーキテクチャは、2つの挑戦的なベンチマークデータセット、つまりHuman3.6MとHumanEva-Iで評価され、はるかに少ないパラメータで最先端の結果を達成します。

Despite great progress in video-based 3D human pose estimation, it is still challenging to learn a discriminative single-pose representation from redundant sequences. To this end, we propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE. STE not only significantly reduces the computation cost but also effectively aggregates information to a single-vector representation in a global and local fashion. Moreover, a full-to-single supervision scheme is employed at both the full sequence scale and single target frame scale, applying to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision. The proposed architecture is evaluated on two challenging benchmark datasets, namely, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

updated: Wed Apr 14 2021 01:23:08 GMT+0000 (UTC)

published: Fri Mar 26 2021 07:35:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト