Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Xianhang Li; Junhao Zhang; Kunchang Li; Shruti Vyas; Yogesh S Rawat

新規ビューアクション合成のためのポーズ誘導生成的敵対的ネット

私たちは、斬新な視点の人間の行動の統合の問題に焦点を当てています。アクションビデオを考えると、目標は目に見えない視点から同じアクションを生成することです。当然のことながら、新しいビューのビデオ合成は画像合成よりも困難です。それには、時間的コヒーレンシーを備えた一連の現実的なフレームの合成が必要です。さらに、さまざまなアクションを新しいターゲットビューに転送するには、アクションカテゴリと視点の変更を同時に認識する必要があります。これらの課題に対処するために、ポーズを利用してこのタスクの難しさを軽減する、ポーズ誘導アクション分離可能生成的敵対的ネット（PAS-GAN）という名前の新しいフレームワークを提案します。まず、アクションをソースビューからターゲットビューに変換し、2D座標空間で新しいビューポーズシーケンスを生成する反復ポーズ変換モジュールを提案します。次に、適切に変換されたポーズシーケンスにより、ターゲットビューのアクションと背景を分離できます。新しいローカル-グローバル空間変換モジュールを採用して、これらのアクションと背景機能を使用して、ターゲットビューでシーケンシャルビデオ機能を効果的に生成します。最後に、生成されたビデオ機能は、3Dデコーダーの助けを借りて人間の行動を合成するために使用されます。さらに、ビデオの動的アクションに焦点を当てるために、ビデオ品質をさらに向上させる新しいマルチスケールアクション分離可能損失を提案します。 2つの大規模なマルチビューヒューマンアクションデータセット、NTU-RGBDとPKU-MMDで広範な実験を実施し、既存のアプローチよりも優れたPAS-GANの有効性を実証します。

We focus on the problem of novel-view human action synthesis. Given an action video, the goal is to generate the same action from an unseen viewpoint. Naturally, novel view video synthesis is more challenging than image synthesis. It requires the synthesis of a sequence of realistic frames with temporal coherency. Besides, transferring the different actions to a novel target view requires awareness of action category and viewpoint change simultaneously. To address these challenges, we propose a novel framework named Pose-guided Action Separable Generative Adversarial Net (PAS-GAN), which utilizes pose to alleviate the difficulty of this task. First, we propose a recurrent pose-transformation module which transforms actions from the source view to the target view and generates novel view pose sequence in 2D coordinate space. Second, a well-transformed pose sequence enables us to separatethe action and background in the target view. We employ a novel local-global spatial transformation module to effectively generate sequential video features in the target view using these action and background features. Finally, the generated video features are used to synthesize human action with the help of a 3D decoder. Moreover, to focus on dynamic action in the video, we propose a novel multi-scale action-separable loss which further improves the video quality. We conduct extensive experiments on two large-scale multi-view human action datasets, NTU-RGBD and PKU-MMD, demonstrating the effectiveness of PAS-GAN which outperforms existing approaches.

updated: Fri Oct 15 2021 10:33:09 GMT+0000 (UTC)

published: Fri Oct 15 2021 10:33:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト