TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction

Sibo Tian; Minghui Zheng; Xiao Liang

TransFusion: 3D 人間の動きを予測するための実用的かつ効果的なトランスフォーマーベースの普及モデル

人間の動きの予測は、将来のインテリジェントな再製造システムにおいて、安全かつ効果的な人間とロボットの緊密な連携を確保する上で重要な役割を果たします。既存の研究は、精度を重視して単一の将来の動きを予測する研究と、観測に基づいて多様な予測を生成する研究の 2 つのグループに分類できます。前者のグループは、人間の動きの不確実性とマルチモーダルな性質に対処できません。一方、後者のグループは、地上の真実から大きく逸脱するか、歴史的文脈内で非現実的になるモーションシーケンスを作成することがよくあります。これらの問題に取り組むために、私たちは、一定レベルの多様性を維持しながら、より発生する可能性の高いサンプルを生成できる、3D 人間の動作予測のための革新的で実用的な拡散ベースのモデルである TransFusion を提案します。私たちのモデルは、浅い層と深い層の間の長いスキップ接続を備えたバックボーンとして Transformer を利用しています。さらに、離散コサイン変換を使用して周波数空間でモーションシーケンスをモデル化し、それによってパフォーマンスを向上させます。クロスアテンションや適応層正規化などの追加モジュールを利用して、過去に観察された動きに基づいて予測を条件付けする以前の拡散ベースのモデルとは対照的に、条件を含むすべての入力をトークンとして扱い、既存のアプローチと比較してより軽量なモデルを作成します。人間の動作予測モデルの有効性を検証するために、ベンチマークデータセットに対して広範な実験研究が行われています。

Predicting human motion plays a crucial role in ensuring a safe and effective human-robot close collaboration in intelligent remanufacturing systems of the future. Existing works can be categorized into two groups: those focusing on accuracy, predicting a single future motion, and those generating diverse predictions based on observations. The former group fails to address the uncertainty and multi-modal nature of human motion, while the latter group often produces motion sequences that deviate too far from the ground truth or become unrealistic within historical contexts. To tackle these issues, we propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction which can generate samples that are more likely to happen while maintaining a certain level of diversity. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. Additionally, we employ the discrete cosine transform to model motion sequences in the frequency space, thereby improving performance. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization to condition the prediction on past observed motion, we treat all inputs, including conditions, as tokens to create a more lightweight model compared to existing approaches. Extensive experimental studies are conducted on benchmark datasets to validate the effectiveness of our human motion prediction model.

updated: Sun Jul 30 2023 01:52:07 GMT+0000 (UTC)

published: Sun Jul 30 2023 01:52:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト