PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

Jiafei Duan; Samson Yu; Soujanya Poria; Bihan Wen; Cheston Tan

PIP：スパン選択によるメンタルシミュレーションによる物理的相互作用の予測

物理的な相互作用の結果を正確に予測することは、人間の知性の重要な要素であり、現実の世界でロボットを安全かつ効率的に展開するために重要です。物理的相互作用の結果を予測することを学習する既存のビジョンベースの直感的な物理モデルがありますが、それらは主に、視覚入力または潜在空間から抽出された物理特性（質量、摩擦、速度など）に基づいて将来のフレームの短いシーケンスを生成することに焦点を当てています。ただし、異なるオブジェクト間の複数の相互作用を伴う長い物理的相互作用シーケンスでテストされる直感的な物理モデルが不足しています。近似精神シミュレーション中の選択的な時間的注意は、物理的相互作用の結果予測において人間を助けると仮定します。これらの動機で、私たちは新しいスキームを提案します：スパン選択（PIP）による精神シミュレーションによる物理的相互作用予測。深い生成モデルを利用して、物理的相互作用の結果を予測するためのスパン選択の形で選択的な時間的注意を採用する前に、物理的相互作用の将来のフレームを生成することにより、近似精神シミュレーションをモデル化します。モデルを評価するために、3D環境での3つの主要な物理的相互作用の長いシーケンスを持つ合成ビデオの大規模なSPACE +データセットをさらに提案します。私たちの実験は、PIPが、メンタルシミュレーションを利用する人間、ベースライン、および関連する直感的な物理モデルよりも優れていることを示しています。さらに、PIPのスパン選択モジュールは、オブジェクト間の主要な物理的相互作用を示すフレームを効果的に識別し、解釈可能性を高めます。

Accurate prediction of physical interaction outcomes is a crucial component of human intelligence and is important for safe and efficient deployments of robots in the real world. While there are existing vision-based intuitive physics models that learn to predict physical interaction outcomes, they mostly focus on generating short sequences of future frames based on physical properties (e.g. mass, friction and velocity) extracted from visual inputs or a latent space. However, there is a lack of intuitive physics models that are tested on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations helps humans in physical interaction outcome prediction. With these motivations, we propose a novel scheme: Physical Interaction Prediction via Mental Simulation with Span Selection (PIP). It utilizes a deep generative model to model approximate mental simulations by generating future frames of physical interactions before employing selective temporal attention in the form of span selection for predicting physical interaction outcomes. To evaluate our model, we further propose the large-scale SPACE+ dataset of synthetic videos with long sequences of three prime physical interactions in a 3D environment. Our experiments show that PIP outperforms human, baseline, and related intuitive physics models that utilize mental simulation. Furthermore, PIP's span selection module effectively identifies the frames indicating key physical interactions among objects, allowing for added interpretability.

updated: Sun Nov 28 2021 15:08:06 GMT+0000 (UTC)

published: Fri Sep 10 2021 06:11:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト