CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes

Kim Youwang; Kim Ji-Yeon; Tae-Hyun Oh

CLIP-Actor：人間のメッシュをアニメーション化するためのテキスト駆動型の推奨とスタイル化

人間のメッシュアニメーションのためのテキスト駆動型モーション推奨およびニューラルメッシュスタイリングシステムであるCLIP-Actorを提案します。 CLIP-Actorは、モーションシーケンスを推奨し、メッシュスタイルの属性を学習することにより、テキストプロンプトに準拠するように3Dヒューマンメッシュをアニメートします。アーティストがデザインしたメッシュコンテンツが最初からテキストに準拠していない場合、以前の作業ではもっともらしい結果を生成できません。代わりに、言語ラベルを使用した大規模な人間の動きのデータセットを活用して、テキスト駆動型の人間の動きの推奨システムを構築します。自然言語のプロンプトが与えられると、CLIP-Actorは最初に、プロンプトに大まかな方法で一致する人間の動きを提案します。次に、各フレームのポーズから解きほぐされた方法で、推奨されるメッシュシーケンスを詳細化およびテクスチャ化する最適化による合成方法を提案します。これにより、スタイル属性は、時間的に一貫性があり、ポーズにとらわれない方法でプロンプトに準拠できます。分離された神経最適化は、マルチフレームの人間の動きからの時空間ビューの拡張も可能にします。さらに、マスク加重埋め込み注意を提案します。これは、前景ピクセルが不足している気を散らすレンダリングを拒否することにより、最適化プロセスを安定させます。 CLIP-Actorが、自然言語のプロンプトから詳細なジオメトリとテクスチャを使用して、動きのあるもっともらしい、人間が認識できるスタイルの3Dヒューマンメッシュを生成することを示します。

We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation. CLIP-Actor animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and learning mesh style attributes. Prior work fails to generate plausible results when the artist-designed mesh content does not conform to the text from the beginning. Instead, we build a text-driven human motion recommendation system by leveraging a large-scale human motion dataset with language labels. Given a natural language prompt, CLIP-Actor first suggests a human motion that conforms to the prompt in a coarse-to-fine manner. Then, we propose a synthesize-through-optimization method that detailizes and texturizes a recommended mesh sequence in a disentangled way from the pose of each frame. It allows the style attribute to conform to the prompt in a temporally-consistent and pose-agnostic manner. The decoupled neural optimization also enables spatio-temporal view augmentation from multi-frame human motion. We further propose the mask-weighted embedding attention, which stabilizes the optimization process by rejecting distracting renders containing scarce foreground pixels. We demonstrate that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture from a natural language prompt.

updated: Thu Jun 09 2022 09:50:39 GMT+0000 (UTC)

published: Thu Jun 09 2022 09:50:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト