Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

Julian Chibane; Aayush Bansal; Verica Lazova; Gerard Pons-Moll

ステレオ放射輝度フィールド（SRF）：新しいシーンのスパースビューのビュー合成の学習

最近のニューラルビュー合成方法は、マルチビュー再構成に依存する従来のパイプラインを超えて、印象的な品質とリアリズムを実現しています。 NeRFなどの最先端の方法は、ニューラルネットワークを使用して単一のシーンを学習するように設計されており、高密度のマルチビュー入力を必要とします。新しいシーンでテストするには、最初から再トレーニングする必要があり、2〜3日かかります。この作業では、ステレオ放射輝度フィールド（SRF）を紹介します。これは、エンドツーエンドでトレーニングされ、新しいシーンに一般化され、テスト時にスパースビューのみを必要とするニューラルビュー合成アプローチです。中心的なアイデアは、ステレオ画像内の類似した画像領域を見つけることによって表面点を推定する、古典的なマルチビューステレオ手法に触発されたニューラルアーキテクチャです。 SRFでは、入力画像のステレオ対応のエンコーディングを指定して、各3Dポイントの色と密度を予測します。エンコーディングは、ペアワイズ類似性のアンサンブルによって暗黙的に学習されます-古典的なステレオをエミュレートします。実験は、SRFがシーンに過剰適合する代わりに構造を学習することを示しています。 DTUデータセットの複数のシーンでトレーニングを行い、再トレーニングせずに新しいシーンに一般化します。入力として必要なのは、10個のスパースビューとスプレッドビューのみです。 10〜15分の微調整により結果がさらに改善され、シーン固有のモデルよりも大幅にシャープで詳細な結果が得られることを示します。コード、モデル、およびビデオは、https：//virtualhumans.mpi-inf.mpg.de/srf/で入手できます。

Recent neural view synthesis methods have achieved impressive quality and realism, surpassing classical pipelines which rely on multi-view reconstruction. State-of-the-Art methods, such as NeRF, are designed to learn a single scene with a neural network and require dense multi-view inputs. Testing on a new scene requires re-training from scratch, which takes 2-3 days. In this work, we introduce Stereo Radiance Fields (SRF), a neural view synthesis approach that is trained end-to-end, generalizes to new scenes, and requires only sparse views at test time. The core idea is a neural architecture inspired by classical multi-view stereo methods, which estimates surface points by finding similar image regions in stereo images. In SRF, we predict color and density for each 3D point given an encoding of its stereo correspondence in the input images. The encoding is implicitly learned by an ensemble of pair-wise similarities -- emulating classical stereo. Experiments show that SRF learns structure instead of overfitting on a scene. We train on multiple scenes of the DTU dataset and generalize to new ones without re-training, requiring only 10 sparse and spread-out views as input. We show that 10-15 minutes of fine-tuning further improve the results, achieving significantly sharper, more detailed results than scene-specific models. The code, model, and videos are available at https://virtualhumans.mpi-inf.mpg.de/srf/.

updated: Wed Apr 14 2021 15:38:57 GMT+0000 (UTC)

published: Wed Apr 14 2021 15:38:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト