LARNet: Latent Action Representation for Human Action Synthesis

Naman Biyani; Aayush J Rana; Shruti Vyas; Yogesh S Rawat

LARNet：ヒューマンアクション合成のための潜在アクション表現

人間のアクションビデオを生成するための新しいエンドツーエンドのアプローチであるLARNetを紹介します。ビデオを合成するための外観とダイナミクスの共同生成モデリングは非常に困難であるため、ビデオ合成の最近の研究では、これら2つの要素を分解することが提案されています。ただし、これらの方法では、ビデオダイナミクスをモデル化するためにドライビングビデオが必要です。この作業では、代わりに生成的アプローチを提案します。これは、推論中にドライビングビデオの必要性を回避して、潜在空間でのアクションダイナミクスを明示的に学習します。生成されたアクションダイナミクスは、繰り返しの階層構造を使用して外観と統合されます。これにより、さまざまなスケールでモーションが誘発され、粗いレベルと細かいレベルの両方のアクションの詳細に焦点が当てられます。さらに、合成されたビデオの時間的コヒーレンシを改善することを目的とした、新しい混合敵対的損失関数を提案します。提案されたアプローチを4つの実世界の人間の行動データセットで評価し、人間の行動の生成における提案されたアプローチの有効性を示します。コードとモデルは公開されます。

We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompose these two factors. However, these methods require a driving video to model the video dynamics. In this work, we propose a generative approach instead, which explicitly learns action dynamics in latent space avoiding the need of a driving video during inference. The generated action dynamics is integrated with the appearance using a recurrent hierarchical structure which induces motion at different scales to focus on both coarse as well as fine level action details. In addition, we propose a novel mix-adversarial loss function which aims at improving the temporal coherency of synthesized videos. We evaluate the proposed approach on four real-world human action datasets demonstrating the effectiveness of the proposed approach in generating human actions. The code and models will be made publicly available.

updated: Thu Oct 21 2021 05:04:32 GMT+0000 (UTC)

published: Thu Oct 21 2021 05:04:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト