SUDS: Scalable Urban Dynamic Scenes

Haithem Turki; Jason Y. Zhang; Francesco Ferroni; Deva Ramanan

SUDS: スケーラブルなアーバンダイナミックシーン

ニューラルラジアンスフィールド (NeRF) を動的な大規模都市シーンに拡張します。以前の作業では、短い時間 (最大 10 秒) の単一のビデオクリップを再構築する傾向があります。このような方法は、(a) 移動オブジェクトと入力ビデオの数に比例してスケーリングする傾向があるため、それぞれに個別のモデルが構築されるため、(b) 手動またはカテゴリー別モデル。動的な都市の真のオープンワールド再構築への一歩として、2 つの重要なイノベーションを導入します。(a) シーンを 3 つの個別のハッシュテーブルデータ構造に因数分解して、静的、動的、および遠方場の放射輝度フィールドを効率的にエンコードします。(b) ) RGB 画像、スパース LiDAR、市販の自己教師あり 2D 記述子、および最も重要な 2D オプティカルフローで構成されるラベルのないターゲット信号を利用します。フォトメトリック、幾何学的、および特徴メトリック再構成損失を介してこのような入力を操作可能にすることで、SUDS は動的シーンを静的背景、個々のオブジェクト、およびそれらの動きに分解できます。マルチブランチテーブル表現と組み合わせると、このような再構成は、数百キロメートルの地理空間フットプリントにまたがる 1,700 のビデオから 120 万フレームにわたる数万のオブジェクトにスケーリングできます。これは、これまでに構築された最大の動的 NeRF です。動的都市シーンの斬新なビュー合成、教師なし 3D インスタンスセグメンテーション、教師なし 3D 立方体検出など、表現によって可能になるさまざまなタスクに関する定性的な初期結果を提示します。以前の作業と比較するために、KITTI と Virtual KITTI 2 も評価し、グラウンドトゥルース 3D バウンディングボックスアノテーションに依存する最先端の方法を凌駕し、トレーニングを 10 倍高速化しました。

We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.

updated: Sat Mar 25 2023 18:55:09 GMT+0000 (UTC)

published: Sat Mar 25 2023 18:55:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト