Language-free Compositional Action Generation via Decoupling Refinement

Xiao Liu; Guangyi Chen; Yansong Tang; Guangrun Wang; Xiao-Ping Zhang; Ser-Nam Lim

デカップリング改良による言語フリーの構成アクションの生成

単純な要素を複雑なコンセプトに組み込むことは、特に 3D アクション生成の場合、非常に重要ですが困難です。既存の手法は主に、構成可能な潜在的なセマンティクスを識別するために広範なニューラル言語の注釈に依存していますが、このプロセスは多くの場合コストと労力がかかります。この研究では、言語補助に依存せずに構成アクションを生成するための新しいフレームワークを紹介します。私たちのアプローチは、アクションカップリング、条件付きアクション生成、デカップリング改良という 3 つの主要なコンポーネントで構成されます。アクションカップリングでは、エネルギーモデルを利用して各サブアクションのアテンションマスクを抽出し、その後、これらのアテンションを使用して 2 つのアクションを統合して、疑似トレーニングサンプルを生成します。次に、条件付き生成モデル CVAE を使用して潜在空間を学習し、多様な生成を促進します。最後に、自己監視型の事前トレーニング済みモデル MAE を活用して、サブアクションと構成アクションの間のセマンティックな一貫性を確保するデカップリング洗練を提案します。この改良プロセスには、生成された 3D アクションを 2D 空間にレンダリングし、これらのイメージを 2 つのサブセグメントに分離し、MAE モデルを使用してサブセグメントから完全なイメージを復元し、復元されたイメージを生のサブアクションからレンダリングされたイメージと一致するように制約することが含まれます。サブアクションと構成アクションの両方を含む既存のデータセットが不足しているため、HumanAct-C と UESTC-C という名前の 2 つの新しいデータセットを作成し、対応する評価指標を提示しました。当社の有効性を示すために、定性的評価と定量的評価の両方が実施されます。

Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.

updated: Mon Jan 08 2024 14:54:49 GMT+0000 (UTC)

published: Fri Jul 07 2023 12:00:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト