SHERF: Generalizable Human NeRF from a Single Image

Shoukang Hu; Fangzhou Hong; Liang Pan; Haiyi Mei; Lei Yang; Ziwei Liu

SHERF: 単一画像からの一般化可能な人間の NeRF

3D 人間を再構築するための既存のヒューマン NeRF メソッドは、通常、マルチビューカメラからの複数の 2D 画像または固定カメラビューからキャプチャされた単眼ビデオに依存しています。ただし、実際のシナリオでは、人間の画像はランダムなカメラアングルからキャプチャされることが多く、高品質の 3D 人間の再構成に課題が生じます。この論文では、単一の入力画像からアニメート可能な 3D 人間を復元するための最初の一般化可能な Human NeRF モデルである SHERF を提案します。 SHRF は、正準空間で人間の 3D 表現を抽出してエンコードし、自由なビューとポーズからのレンダリングとアニメーションを可能にします。忠実度の高い斬新なビューとポーズの合成を実現するには、エンコードされた 3D 人間の表現が、グローバルな外観とローカルのきめの細かいテクスチャの両方をキャプチャする必要があります。この目的のために、有益なエンコーディングを容易にするために、グローバル、ポイントレベル、およびピクセル整列機能を含む 3D 対応の階層機能のバンクを提案します。グローバルな特徴は、単一の入力画像から抽出された情報を強化し、部分的な 2D 観察から欠落している情報を補完します。ポイントレベルの特徴は 3D の人間の構造の強力な手がかりを提供しますが、ピクセルで整列された特徴はより細かい詳細を保持します。 3D 対応の階層機能バンクを効果的に統合するために、機能融合トランスフォーマーを設計します。 THuman、RenderPeople、ZJU_MoCap、および HuMMan データセットに関する広範な実験により、SHERF が最先端のパフォーマンスを達成し、斬新なビューとポーズ合成の一般化が容易になることが実証されました。

Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.

updated: Wed Aug 16 2023 17:58:35 GMT+0000 (UTC)

published: Wed Mar 22 2023 17:59:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト