Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training

Junfan Lin; Jianlong Chang; Lingbo Liu; Guanbin Li; Liang Lin; Qi Tian; Chang-wen Chen

存在は非存在から生まれる: 言葉を使わないトレーニングによるオープンボキャブラリーのテキストからモーションへの生成

テキストからモーションへの生成は、入力テキストと同じセマンティクスでモーションを合成することを目的とした、新たな困難な問題です。ただし、ラベル付けされた多様なトレーニングデータが不足しているため、ほとんどのアプローチでは、特定の種類のテキストアノテーションに制限するか、効率と安定性を犠牲にして推論中にテキストに対応するためにオンラインの最適化を必要とします。この論文では、ペアのトレーニングデータも、目に見えないテキストに適応するための追加のオンライン最適化も必要としない、ゼロショット学習方法でのオフラインのオープンボキャブラリーテキストからモーションへの生成を調査します。 NLP の迅速な学習に着想を得て、マスクされたモーションから完全なモーションを再構築することを学習するモーションジェネレーターを事前トレーニングします。推論中、モーションジェネレータを変更する代わりに、モーションジェネレータがモーションを「再構築」するためのプロンプトとして、入力テキストをマスクされたモーションに再定式化します。プロンプトを構築する際、プロンプトのマスクされていないポーズは、テキストからポーズへのジェネレーターによって合成されます。テキストからポーズへのジェネレーターの最適化を監視するために、テキストと 3D ポーズの間の配置を測定するための最初のテキストとポーズの配置モデルを提案します。また、ポーズジェネレーターが限られたトレーニングテキストに過度に適合するのを防ぐために、トレーニングテキストなしでテキストからポーズジェネレーターを最適化する新しいワードレストレーニングメカニズムをさらに提案します。包括的な実験結果は、私たちの方法がベースライン方法に対して大幅な改善を得ることを示しています。コードが利用可能です。

Text-to-motion generation is an emerging and challenging problem, which aims to synthesize motion with the same semantics as the input text. However, due to the lack of diverse labeled training data, most approaches either limit to specific types of text annotations or require online optimizations to cater to the texts during inference at the cost of efficiency and stability. In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner that neither requires paired training data nor extra online optimization to adapt for unseen texts. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. During inference, instead of changing the motion generator, our method reformulates the input text into a masked motion as the prompt for the motion generator to ``reconstruct'' the motion. In constructing the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose generator. To supervise the optimization of the text-to-pose generator, we propose the first text-pose alignment model for measuring the alignment between texts and 3D poses. And to prevent the pose generator from overfitting to limited training texts, we further propose a novel wordless training mechanism that optimizes the text-to-pose generator without any training texts. The comprehensive experimental results show that our method obtains a significant improvement against the baseline methods. The code is available.

updated: Mon Mar 20 2023 04:36:45 GMT+0000 (UTC)

published: Fri Oct 28 2022 06:20:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト