OhMG: Zero-shot Open-vocabulary Human Motion Generation

Junfan Lin; Jianlong Chang; Lingbo Liu; Guanbin Li; Liang Lin; Qi Tian; Chang-wen Chen

OhMG: ゼロショットオープンボキャブラリーヒューマンモーションジェネレーション

テキストに沿ってモーションを生成することは、最近ますます注目を集めています。ただし、オープン語彙の人間のモーション生成は依然としてタッチレスのままであり、多様なラベル付きデータが不足しています。幸いなことに、大規模なマルチモデル基盤モデル (CLIP など) に関する最近の研究では、少数/ゼロショットの画像とテキストの配置で優れたパフォーマンスが実証されており、手動でラベル付けされたデータの必要性が大幅に削減されています。この論文では、CLIP を利用して、オープン語彙による 3D 人間の動きをゼロショット方式で生成します。具体的には、モデルは text2pose と pose2motion の 2 つのステージで構成されています。 text2pose については、CLIP からの直接の監視による最適化の難しさに対処するために、新しいパイプライン蒸留戦略を介して、3D ポーズとテキストを整列させるために、多目的な CLIP モデルをよりスリムでより具体的なモデルに切り分けることを提案します。抽出された 3D ポーズテキストモデルを使用して最適化することで、CLIP のテキストポーズの知識を効果的かつ効率的に text2pose ジェネレーターに具体化することができます。 pose2motion に関しては、高度な言語モデルからインスピレーションを得て、トランスフォーマーベースのモーションモデルを事前トレーニングし、CLIP のモーションダイナミクスの欠如を補っています。その後、text2pose ステージから生成されたポーズをプロンプトとして定式化することにより、モーションジェネレーターは、制御可能かつ柔軟な方法でポーズを参照してモーションを生成できます。私たちの方法は、高度なベースラインに対して検証され、大幅な改善が得られます。コードはこちらで公開します。

Generating motion in line with text has attracted increasing attention nowadays. However, open-vocabulary human motion generation still remains touchless and undergoes the lack of diverse labeled data. The good news is that, recent studies of large multi-model foundation models (e.g., CLIP) have demonstrated superior performance on few/zero-shot image-text alignment, largely reducing the need for manually labeled data. In this paper, we take advantage of CLIP for open-vocabulary 3D human motion generation in a zero-shot manner. Specifically, our model is composed of two stages, i.e., text2pose and pose2motion. For text2pose, to address the difficulty of optimization with direct supervision from CLIP, we propose to carve the versatile CLIP model into a slimmer but more specific model for aligning 3D poses and texts, via a novel pipeline distillation strategy. Optimizing with the distilled 3D pose-text model, we manage to concretize the text-pose knowledge of CLIP into a text2pose generator effectively and efficiently. As for pose2motion, drawing inspiration from the advanced language model, we pretrain a transformer-based motion model, which makes up for the lack of motion dynamics of CLIP. After that, by formulating the generated poses from the text2pose stage as prompts, the motion generator can generate motions referring to the poses in a controllable and flexible manner. Our method is validated against advanced baselines and obtains sharp improvements. The code will be released here.

updated: Fri Oct 28 2022 06:20:55 GMT+0000 (UTC)

published: Fri Oct 28 2022 06:20:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト