Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Mehdi S. M. Sajjadi; Henning Meyer; Etienne Pot; Urs Bergmann; Klaus Greff; Noha Radwan; Suhani Vora; Mario Lucic; Daniel Duckworth; Alexey Dosovitskiy; Jakob Uszkoreit; Thomas Funkhouser; Andrea Tagliasacchi

シーン表現トランスフォーマー：セット潜在的なシーン表現によるジオメトリフリーの新しいビュー合成

コンピュータビジョンの古典的な問題は、インタラクティブな速度で新しいビューをレンダリングするために使用できるいくつかの画像から3Dシーン表現を推測することです。以前の作業は、テクスチャメッシュなどの事前定義された3D表現、または放射輝度フィールドなどの暗黙的な表現の再構築に焦点を当てており、多くの場合、新しいシーンごとに正確なカメラポーズと長い処理時間を備えた入力画像が必要です。この作業では、新しい領域のポーズまたはポーズなしのRGB画像を処理し、「セット潜在シーン表現」を推測し、新しいビューをすべて1つのフィードフォワードで合成する方法であるシーン表現トランスフォーマー（SRT）を提案します。合格。シーン表現を計算するために、Vision Transformerを画像のセットに一般化して、グローバルな情報統合を可能にし、3D推論を可能にすることを提案します。効率的なデコーダトランスフォーマーは、シーン表現に注目して新しいビューをレンダリングすることにより、ライトフィールドをパラメーター化します。学習は、新規ビューの再構築エラーを最小限に抑えることにより、エンドツーエンドで監視されます。この方法は、論文用に作成された新しいデータセットを含む合成データセットのPSNRと速度の点で、最近のベースラインを上回っていることを示しています。さらに、ストリートビュー画像を使用して、SRTが実際の屋外環境のインタラクティブな視覚化とセマンティックセグメンテーションをサポートするようにスケーリングすることを示します。

A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.

updated: Mon Nov 29 2021 09:54:01 GMT+0000 (UTC)

published: Thu Nov 25 2021 16:18:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト