Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Wenhao Li; Hong Liu; Runwei Ding; Mengyuan Liu; Pichao Wang; Wenming Yang

3D人間の姿勢推定のためのStridedTransformerによる時間的コンテキストの活用

ビデオからの3D人間のポーズ推定は大きく進歩していますが、冗長な2Dポーズシーケンスを最大限に活用して、単一の3Dポーズを生成するための代表的な表現を学習することは未解決の問題です。この目的のために、Strided Transformerと呼ばれる改良されたTransformerベースのアーキテクチャを提案し、ビデオでの3D人間のポーズ推定のために、一連の2D関節位置を3Dポーズに持ち上げます。具体的には、バニラトランスフォーマーエンコーダー（VTE）を採用して、2Dポーズシーケンスの長距離依存性をモデル化します。シーケンスの冗長性を減らし、ローカルコンテキストからの情報を集約するために、ストライド畳み込みがVTEに組み込まれ、シーケンスの長さが徐々に短くなります。変更されたVTEは、VTEの出力に基づいて構築されたストライドトランスフォーマーエンコーダー（STE）と呼ばれます。 STEは、長距離情報を階層的なグローバルおよびローカル方式で単一ベクトル表現に効果的に集約するだけでなく、計算コストを大幅に削減します。さらに、フルからシングルへの監視スキームは、フルシーケンススケールとシングルターゲットフレームスケールの両方で設計され、それぞれVTEとSTEの出力に適用されます。このスキームは、単一のターゲットフレーム監視と組み合わせて追加の時間的平滑性制約を課し、ターゲットフレームの特徴の表現能力を向上させます。提案されたアーキテクチャは、Human3.6MとHumanEva-Iという2つの挑戦的なベンチマークデータセットで評価され、はるかに少ないパラメータで最先端の結果を達成します。

Despite great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of redundant 2D pose sequences to learn representative representation for generating one single 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, strided convolutions are incorporated into VTE to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both the full sequence scale and single target frame scale, applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and improves the representation ability of features for the target frame. The proposed architecture is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

updated: Thu Jul 22 2021 10:15:11 GMT+0000 (UTC)

published: Fri Mar 26 2021 07:35:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト