SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Ziyi Wu; Nikita Dvornik; Klaus Greff; Thomas Kipf; Animesh Garg

SlotFormer: オブジェクト中心モデルによる教師なしビジュアルダイナミクスシミュレーション

視覚的な観察からダイナミクスを理解することは、シーンから個々のオブジェクトを解きほぐし、それらの相互作用を学習する必要がある難しい問題です。最近のオブジェクト中心のモデルは、シーンをオブジェクトにうまく分解できますが、それらのダイナミクスを効果的にモデル化することは依然として課題です。 SlotFormer を導入することで、この問題に対処します。これは、学習したオブジェクト中心の表現で動作する Transformer ベースの自己回帰モデルです。ビデオクリップが与えられると、私たちのアプローチはオブジェクトの特徴を推論して時空間関係をモデル化し、正確な将来のオブジェクトの状態を予測します。この論文では、SlotFormer を適用して、複雑なオブジェクトの相互作用を伴うデータセットでビデオ予測を実行することに成功しました。さらに、教師なし SlotFormer のダイナミクスモデルを使用して、Visual Question Answering (VQA) や目標条件付き計画などの教師付きダウンストリームタスクのパフォーマンスを向上させることができます。ダイナミクスモデリングに関する過去の研究と比較して、私たちの方法は、高品質のビジュアル生成を維持しながら、オブジェクトダイナミクスの大幅に優れた長期合成を実現します。さらに、SlotFormer を使用すると、VQA モデルはオブジェクトレベルのラベルを使用せずに未来について推論でき、グラウンドトゥルースアノテーションを使用する対応モデルよりも優れています。最後に、モデルベースの計画の世界モデルとして機能する能力を示します。これは、そのようなタスク専用に設計された方法と競合します。

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.

updated: Wed Oct 12 2022 01:53:58 GMT+0000 (UTC)

published: Wed Oct 12 2022 01:53:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト