Towards Visual Foundational Models of Physical Scenes

Chethan Parameshwara; Alessandro Achille; Matthew Trager; Xiaolong Li; Jiawei Mo; Matthew Trager; Ashwin Swaminathan; CJ Taylor; Dheera Venkatraman; Xiaohan Fei; Stefano Soatto

物理的シーンの視覚的基礎モデルに向けて

トレーニング基準として画像予測のみを使用して、物理シーンの汎用視覚表現を学習するための最初のステップについて説明します。そのために、最初に「物理シーン」を定義し、異なるエージェントが同じシーンの異なる表現を維持する場合でも、推論できる基礎となる物理シーンは一意であることを示します。次に、NeRF には外挿メカニズムがないため、物理的なシーンを表現できないことを示します。ただし、少なくとも理論的には、拡散モデルによってこれらを提供できる可能性があります。この仮説を経験的にテストするには、NeRF を拡散モデルと組み合わせることができます。このプロセスを NeRF 拡散と呼び、物理シーンの教師なし表現として使用されます。私たちの分析は視覚データに限定されており、独立した感覚モダリティによって提供される外部接地メカニズムはありません。

We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.

updated: Tue Jun 06 2023 14:45:44 GMT+0000 (UTC)

published: Tue Jun 06 2023 14:45:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト