Physion: Evaluating Physical Prediction from Vision in Humans and Machines

Daniel M. Bear; Elias Wang; Damian Mrowca; Felix J. Binder; Hsiao-Yu Fish Tung; R. T. Pramod; Cameron Holdaway; Sirui Tao; Kevin Smith; Fan-Yun Sun; Li Fei-Fei; Nancy Kanwisher; Joshua B. Tenenbaum; Daniel L. K. Yamins; Judith E. Fan

物理学：人間と機械の視覚からの物理的予測の評価

現在のビジョンアルゴリズムは多くの困難なタスクに優れていますが、実際の環境の物理的なダイナミクスをどれだけよく理解しているかは不明です。ここでは、物理シナリオが時間の経過とともにどのように進化するかを予測する能力を厳密に評価するためのデータセットおよびベンチマークであるPhysionを紹介します。私たちのデータセットは、剛体と軟体の衝突、安定したマルチオブジェクト構成、ローリング、スライド、投射物の動きなど、さまざまな物理現象の現実的なシミュレーションを特徴としているため、以前のベンチマークよりも包括的な課題を提供します。 Physionを使用して、アーキテクチャ、学習目標、入出力構造、トレーニングデータが異なる一連のモデルのベンチマークを行いました。並行して、同じ一連のシナリオで人間の予測行動の正確な測定値を取得し、どのモデルでも人間の行動をどれだけうまく近似できるかを直接評価できるようにしました。オブジェクト中心の表現を学習する視覚アルゴリズムは、一般にそうでないものよりも優れていますが、それでも人間のパフォーマンスにははるかに及ばないことがわかりました。一方、物理的状態情報に直接アクセスできるグラフニューラルネットワークは、パフォーマンスが大幅に向上し、人間が行う予測とより類似した予測を行います。これらの結果は、シーンの物理的表現を抽出することが、視覚アルゴリズムで人間レベルおよび人間のような物理的理解を達成するための主なボトルネックであることを示唆しています。 Physionを使用して追加のモデルを完全に再現可能な方法でベンチマークすることを容易にするために、すべてのデータとコードを公開しました。これにより、物理環境を人と同じようにしっかりと理解する視覚アルゴリズムの進捗状況を体系的に評価できます。

While current vision algorithms excel at many challenging tasks, it is unclear how well they understand the physical dynamics of real-world environments. Here we introduce Physion, a dataset and benchmark for rigorously evaluating the ability to predict how physical scenarios will evolve over time. Our dataset features realistic simulations of a wide range of physical phenomena, including rigid and soft-body collisions, stable multi-object configurations, rolling, sliding, and projectile motion, thus providing a more comprehensive challenge than previous benchmarks. We used Physion to benchmark a suite of models varying in their architecture, learning objective, input-output structure, and training data. In parallel, we obtained precise measurements of human prediction behavior on the same set of scenarios, allowing us to directly evaluate how well any model could approximate human behavior. We found that vision algorithms that learn object-centric representations generally outperform those that do not, yet still fall far short of human performance. On the other hand, graph neural networks with direct access to physical state information both perform substantially better and make predictions that are more similar to those made by humans. These results suggest that extracting physical representations of scenes is the main bottleneck to achieving human-level and human-like physical understanding in vision algorithms. We have publicly released all data and code to facilitate the use of Physion to benchmark additional models in a fully reproducible manner, enabling systematic evaluation of progress towards vision algorithms that understand physical environments as robustly as people do.

updated: Mon Jun 20 2022 14:27:21 GMT+0000 (UTC)

published: Tue Jun 15 2021 16:13:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト