Is Attention All That NeRF Needs?

Mukund Varma T; Peihao Wang; Xuxi Chen; Tianlong Chen; Subhashini Venugopalan; Zhangyang Wang

NeRF に必要なのは注意だけですか?

一般化可能な NeRF Transformer (GNT) を紹介します。これは、ニューラルラディアンスフィールド (NeRF) を再構築し、ソースビューからオンザフライで新しいビューをレンダリングすることを学習するトランスフォーマーベースのアーキテクチャです。 NeRF に関する以前の研究では、手作りのレンダリング方程式を逆にすることでシーン表現を最適化していましたが、GNT では、トランスフォーマーを 2 段階で使用してシーン全体を一般化するニューラル表現とレンダリングを実現しています。 (1) ビュートランスフォーマーは、マルチビュージオメトリを注意ベースのシーン表現の誘導バイアスとして活用し、隣接するビューのエピポーララインからの情報を集約することにより、座標整列機能を予測します。 (2) レイトランスフォーマーは、レイマーチング中にサンプルポイントに沿ってビュートランスフォーマーから特徴をデコードするために注意を使用して、新しいビューをレンダリングします。私たちの実験は、単一のシーンで最適化された場合、学習したレイレンダラーにより、GNT が明示的なレンダリング式なしで NeRF を正常に再構築できることを示しています。複数のシーンでトレーニングすると、GNT は目に見えないシーンに移行するときに一貫して最先端のパフォーマンスを達成し、他のすべての方法よりも平均で約 10% 優れています。深度とオクルージョンを推測するための学習済みアテンションマップの分析は、アテンションが物理的に根拠のあるレンダリングを学習できることを示しています。私たちの結果は、トランスフォーマーがグラフィックスの汎用モデリングツールとして有望であることを示しています。ビデオの結果については、プロジェクトページ (https://vita-group.github.io/GNT/) を参照してください。

We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/.

updated: Thu Mar 02 2023 04:54:00 GMT+0000 (UTC)

published: Wed Jul 27 2022 05:09:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト