ACORN: Adaptive Coordinate Networks for Neural Scene Representation

Julien N. P. Martel; David B. Lindell; Connor Z. Lin; Eric R. Chan; Marco Monteiro; Gordon Wetzstein

ACORN：ニューラルシーン表現のための適応座標ネットワーク

ニューラル表現は、レンダリング、イメージング、幾何学的モデリング、およびシミュレーションのアプリケーションの新しいパラダイムとして登場しました。メッシュ、ポイントクラウド、ボリュームなどの従来の表現と比較して、差別化可能な学習ベースのパイプラインに柔軟に組み込むことができます。神経表現の最近の改善により、信号を中程度の解像度（画像や3D形状など）で詳細に表現できるようになりましたが、大規模または複雑なシーンを適切に表現することは困難であることが証明されています。現在の神経表現は、メガピクセルを超える解像度の画像や、数十万のポリゴンを含む3Dシーンを正確に表現することができません。ここでは、関心のある信号のローカルな複雑さに基づいてトレーニングと推論中にリソースを適応的に割り当てる、新しいハイブリッド暗黙的明示的ネットワークアーキテクチャとトレーニング戦略を紹介します。私たちのアプローチでは、トレーニング中に最適化された、四分木や八分木に似たマルチスケールのブロック座標分解を使用します。ネットワークアーキテクチャは2つの段階で動作します。ネットワークパラメータの大部分を使用して、座標エンコーダは単一のフォワードパスでフィーチャグリッドを生成します。次に、軽量の機能デコーダーを使用して、各ブロック内の数百または数千のサンプルを効率的に評価できます。このハイブリッド暗黙的明示的ネットワークアーキテクチャを使用して、ギガピクセル画像をほぼ40dBのピーク信号対雑音比に適合させる最初の実験を示します。特に、これは、以前に実証された画像フィッティング実験の解像度と比較して、1000倍を超えるスケールの増加を表しています。さらに、私たちのアプローチは、以前の手法よりも大幅に高速かつ優れた3D形状を表現することができます。トレーニング時間を数日から数時間または数分に短縮し、メモリ要件を1桁以上削減します。

Neural representations have emerged as a new paradigm for applications in rendering, imaging, geometric modeling, and simulation. Compared to traditional representations such as meshes, point clouds, or volumes they can be flexibly incorporated into differentiable learning-based pipelines. While recent improvements to neural representations now make it possible to represent signals with fine details at moderate resolutions (e.g., for images and 3D shapes), adequately representing large-scale or complex scenes has proven a challenge. Current neural representations fail to accurately represent images at resolutions greater than a megapixel or 3D scenes with more than a few hundred thousand polygons. Here, we introduce a new hybrid implicit-explicit network architecture and training strategy that adaptively allocates resources during training and inference based on the local complexity of a signal of interest. Our approach uses a multiscale block-coordinate decomposition, similar to a quadtree or octree, that is optimized during training. The network architecture operates in two stages: using the bulk of the network parameters, a coordinate encoder generates a feature grid in a single forward pass. Then, hundreds or thousands of samples within each block can be efficiently evaluated using a lightweight feature decoder. With this hybrid implicit-explicit network architecture, we demonstrate the first experiments that fit gigapixel images to nearly 40 dB peak signal-to-noise ratio. Notably this represents an increase in scale of over 1000x compared to the resolution of previously demonstrated image-fitting experiments. Moreover, our approach is able to represent 3D shapes significantly faster and better than previous techniques; it reduces training times from days to hours or minutes and memory requirements by over an order of magnitude.

updated: Thu May 06 2021 16:21:38 GMT+0000 (UTC)

published: Thu May 06 2021 16:21:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト