Neural Rendering of Humans in Novel View and Pose from Monocular Video

Tiantian Wang; Nikolaos Sarafianos; Ming-Hsuan Yang; Tony Tung

単眼ビデオからの斬新なビューとポーズでの人間のニューラルレンダリング

単眼ビデオを入力として与えられた新しいビューとポーズの下で、写真のようにリアルな人間を生成する新しい方法を紹介します。このトピックに関する最近の大きな進歩にもかかわらず、動的なシーンのシナリオで共有された正規のニューラル放射輝度フィールドを調査するいくつかの方法があるにもかかわらず、目に見えないポーズのユーザー制御モデルを学習することは依然として困難な作業です。この問題に取り組むために、a) 複数のフレームにわたって観測を統合し、b) 個々のフレームでの外観をエンコードする効果的な方法を紹介します。これは、体型をモデル化する人間のポーズと、人間を部分的に覆う点群の両方を入力として利用することで実現します。私たちのアプローチは、複数のフレーム間で人間のポーズに固定された潜在コードの共有セットと、各フレームとその予測深度によって生成された不完全な点群に固定された外観依存コードを同時に学習します。前者の人間のポーズベースのコードはパフォーマーの形状をモデル化しますが、後者の点群ベースのコードは、目に見えないポーズで欠落している構造について詳細レベルの詳細と理由を予測します。クエリフレーム内の非表示領域をさらに回復するために、テンポラルトランスフォーマーを使用して、クエリフレーム内のポイントの特徴と、自動的に選択されたキーフレームから追跡されたボディポイントを統合します。 ZJU-MoCap を含むさまざまなデータセットからの動的な人間のさまざまなシーケンスに関する実験では、単眼ビデオを入力として与えられた目に見えないポーズや斬新なビューの下で、私たちの方法が既存のアプローチよりも大幅に優れていることが示されています。

We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Despite the significant progress recently on this topic, with several methods exploring shared canonical neural radiance fields in dynamic scene scenarios, learning a user-controlled model for unseen poses remains a challenging task. To tackle this problem, we introduce an effective method to a) integrate observations across several frames and b) encode the appearance at each individual frame. We accomplish this by utilizing both the human pose that models the body shape as well as point clouds that partially cover the human as input. Our approach simultaneously learns a shared set of latent codes anchored to the human pose among several frames, and an appearance-dependent code anchored to incomplete point clouds generated by each frame and its predicted depth. The former human pose-based code models the shape of the performer whereas the latter point cloud-based code predicts fine-level details and reasons about missing structures at the unseen poses. To further recover non-visible regions in query frames, we employ a temporal transformer to integrate features of points in query frames and tracked body points from automatically-selected key frames. Experiments on various sequences of dynamic humans from different datasets including ZJU-MoCap show that our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.

updated: Thu Apr 20 2023 04:08:04 GMT+0000 (UTC)

published: Mon Apr 04 2022 03:09:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト