3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Haotian Xue; Antonio Torralba; Joshua B. Tenenbaum; Daniel LK Yamins; Yunzhu Li; Hsiao-Yu Tung

3D-IntPhys: 困難なシーンでの、より一般化された 3D に基づいた視覚的直感的物理学に向けて

視覚的なシーンが与えられると、人間は、特定のアクションの下でシーンが時間の経過とともにどのように進化するかについて、強い直観を持っています。多くの場合、視覚的直観物理学と呼ばれる直感は、大規模な試行錯誤に頼ることなく、シーンを操作して目的の結果を達成するための効果的な計画を立てることを可能にする重要な能力です。この論文では、流体を含む複雑なシーンのビデオから、3D に基づいた視覚的な直感的な物理モデルを学習できるフレームワークを提示します。私たちの方法は、条件付きのニューラルラディアンスフィールド (NeRF) スタイルのビジュアルフロントエンドと 3D ポイントベースのダイナミクス予測バックエンドで構成されています。これを使用して、強力な関係的および構造的な誘導バイアスを課して、基礎となる環境の構造を捉えることができます。シミュレーターからの密な点軌跡の監視に依存する既存の直感的な点ベースのダイナミクス作品とは異なり、要件を緩和し、カラープライアを使用して取得したマルチビュー RGB 画像と (不完全な) インスタンスマスクへのアクセスのみを想定しています。これにより、提案されたモデルは、正確なポイントの推定と追跡が困難または不可能なシナリオを処理できます。シミュレーションで流体、粒状材料、剛体を含む 3 つの困難なシナリオを含むデータセットを生成します。データセットには高密度の粒子情報が含まれていないため、これまでのほとんどの 3D ベースの直感的な物理パイプラインではほとんど処理できません。私たちのモデルは、未加工の画像から学習することで長期的な将来予測を行うことができ、明示的な 3D 表現空間を使用しないモデルよりも大幅に優れていることを示しています。また、トレーニングが完了すると、外挿設定の下で複雑なシナリオでモデルが強力な一般化を達成できることも示します。

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, using which we can impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks acquired using color prior. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. The datasets do not include any dense particle information so most previous 3D-based intuitive physics pipelines can barely deal with that. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.

updated: Sat Apr 22 2023 19:28:49 GMT+0000 (UTC)

published: Sat Apr 22 2023 19:28:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト