PIP: Physical Interaction Prediction via Mental Imagery with Span Selection

Jiafei Duan; Samson Yu; Soujanya Poria; Bihan Wen; Cheston Tan

PIP：スパン選択によるメンタルイメージによる物理的相互作用の予測

高度な人工知能（AI）を人間の価値観と整合させ、安全なAIを促進するには、AIが物理的な相互作用の結果を予測することが重要です。人間が現実世界のオブジェクト間の物理的相互作用の結果をどのように予測するかについての議論が続いているにもかかわらず、認知に触発されたAIアプローチを介してこのタスクに取り組むことを試みる研究があります。ただし、現実世界での物理的相互作用を予測するために人間が使用する精神的イメージを模倣するAIアプローチはまだ不足しています。この作業では、新しいPIPスキームを提案します。スパン選択を使用したメンタルイメージによる物理的相互作用の予測です。 PIPは、深い生成モデルを利用して、オブジェクト間の物理的相互作用の将来のフレームを出力してから、スパン選択を使用して顕著なフレームに焦点を当てることにより、物理的相互作用を予測するための重要な情報を抽出します。モデルを評価するために、3D環境での3つの物理的相互作用イベントを含む、合成ビデオフレームの大規模なSPACE +データセットを提案します。私たちの実験は、PIPが、見えているオブジェクトと見えていないオブジェクトの両方の物理的相互作用の予測において、ベースラインと人間のパフォーマンスを上回っていることを示しています。さらに、PIPのスパン選択スキームは、生成されたフレーム内でオブジェクト間の物理的相互作用が発生するフレームを効果的に識別できるため、解釈可能性が向上します。

To align advanced artificial intelligence (AI) with human values and promote safe AI, it is important for AI to predict the outcome of physical interactions. Even with the ongoing debates on how humans predict the outcomes of physical interactions among objects in the real world, there are works attempting to tackle this task via cognitive-inspired AI approaches. However, there is still a lack of AI approaches that mimic the mental imagery humans use to predict physical interactions in the real world. In this work, we propose a novel PIP scheme: Physical Interaction Prediction via Mental Imagery with Span Selection. PIP utilizes a deep generative model to output future frames of physical interactions among objects before extracting crucial information for predicting physical interactions by focusing on salient frames using span selection. To evaluate our model, we propose a large-scale SPACE+ dataset of synthetic video frames, including three physical interaction events in a 3D environment. Our experiments show that PIP outperforms baselines and human performance in physical interaction prediction for both seen and unseen objects. Furthermore, PIP's span selection scheme can effectively identify the frames where physical interactions among objects occur within the generated frames, allowing for added interpretability.

updated: Fri Sep 10 2021 06:11:29 GMT+0000 (UTC)

published: Fri Sep 10 2021 06:11:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト