Understanding Physical Dynamics with Counterfactual World Modeling

Rahul Venkatesh; Honglin Chen; Kevin Feigelis; Daniel M. Bear; Khaled Jedoui; Klemen Kotar; Felix Binder; Wanhee Lee; Sherry Liu; Kevin A. Smith; Judith E. Fan; Daniel L. K. Yamins

The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.

updated: Mon Jul 22 2024 07:51:25 GMT+0000 (UTC)

published: Mon Dec 11 2023 03:07:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト