Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

Mike Roberts; Nathan Paczan

Hypersim：ホリスティックな屋内シーンを理解するためのフォトリアリスティックな合成データセット

多くの基本的なシーン理解タスクでは、実際の画像からピクセルごとのグラウンドトゥルースラベルを取得することは困難または不可能です。全体的な屋内シーンを理解するためのフォトリアリスティックな合成データセットであるHypersimを導入することで、この課題に対処します。データセットを作成するために、プロのアーティストによって作成された合成シーンの大規模なリポジトリを活用し、ピクセルごとの詳細なラベルと対応するグラウンドトゥルースジオメトリを使用して、461の屋内シーンの77,400の画像を生成します。私たちのデータセット：（1）公開されている3Dアセットのみに依存しています。（2）すべてのシーンの完全なシーンジオメトリ、マテリアル情報、および照明情報が含まれます。（3）すべての画像のピクセルごとの密なセマンティックインスタンスセグメンテーションが含まれます。（4）すべての画像を、拡散反射、拡散照明、およびビューに依存する照明効果をキャプチャする非拡散残差項に因数分解します。これらの機能を組み合わせることで、データセットは、直接3D監視を必要とする幾何学的学習問題、複数の入力および出力モダリティを共同で推論する必要があるマルチタスク学習問題、および逆レンダリング問題に最適になります。シーン、オブジェクト、ピクセルのレベルでデータセットを分析し、費用、注釈の労力、計算時間の観点からコストを分析します。驚くべきことに、最先端の自然言語処理モデルのトレーニングの約半分のコストで、データセット全体を最初から生成できることがわかりました。データセットの生成に使用したすべてのコードは、オンラインで利用できるようになります。

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. Our dataset: (1) relies exclusively on publicly available 3D assets; (2) includes complete scene geometry, material information, and lighting information for every scene; (3) includes dense per-pixel semantic instance segmentations for every image; and (4) factors every image into diffuse reflectance, diffuse illumination, and a non-diffuse residual term that captures view-dependent lighting effects. Together, these features make our dataset well-suited for geometric learning problems that require direct 3D supervision, multi-task learning problems that require reasoning jointly over multiple input and output modalities, and inverse rendering problems. We analyze our dataset at the level of scenes, objects, and pixels, and we analyze costs in terms of money, annotation effort, and computation time. Remarkably, we find that it is possible to generate our entire dataset from scratch, for roughly half the cost of training a state-of-the-art natural language processing model. All the code we used to generate our dataset will be made available online.

updated: Tue Nov 10 2020 08:05:56 GMT+0000 (UTC)

published: Wed Nov 04 2020 20:12:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト