Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Wenhao Li; Hong Liu; Runwei Ding; Mengyuan Liu; Pichao Wang; Wenming Yang

3D人間のポーズ推定のためのストライドトランスフォーマーによる時間的コンテキストの活用

ビデオからの3D人間のポーズ推定は大きく進歩していますが、冗長な2Dポーズシーケンスを最大限に活用して、1つの3Dポーズを生成するための代表的な表現を学習することは未解決の問題です。この目的のために、Strided Transformerと呼ばれる改良されたTransformerベースのアーキテクチャを提案します。これは、2Dジョイント位置の長いシーケンスを単一の3Dポーズに簡単かつ効果的に持ち上げます。具体的には、Vanilla Transformer Encoder（VTE）を採用して、2Dポーズシーケンスの長距離依存関係をモデル化します。シーケンスの冗長性を減らすために、VTEのフィードフォワードネットワーク内の完全に接続されたレイヤーは、シーケンスの長さを徐々に縮小し、ローカルコンテキストからの情報を集約するためにストライド畳み込みに置き換えられます。変更されたVTEはStridedTransformer Encoder（STE）と呼ばれ、VTEの出力に基づいて構築されます。 STEは、長距離情報を階層的なグローバルおよびローカル方式で単一ベクトル表現に効果的に集約するだけでなく、計算コストを大幅に削減します。さらに、フルからシングルへの監視スキームは、VTEとSTEの出力にそれぞれ適用されるフルシーケンスとシングルターゲットフレームスケールの両方で設計されています。このスキームは、単一のターゲットフレームの監視と組み合わせて、追加の時間的滑らかさの制約を課すため、より滑らかでより正確な3Dポーズを生成するのに役立ちます。提案されたStridedTransformerは、2つの挑戦的なベンチマークデータセット、Human3.6MとHumanEva-Iで評価され、より少ないパラメーターで最先端の結果を達成します。コードとモデルはhttps://github.com/Vegetebird/StridedTransformer-Pose3Dで入手できます。

Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at https://github.com/Vegetebird/StridedTransformer-Pose3D.

updated: Tue Jan 11 2022 02:41:57 GMT+0000 (UTC)

published: Fri Mar 26 2021 07:35:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト