Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Chengmin Gao; Bin Li

ビデオの分解と予測のためのオブジェクト中心表現の時間条件付き生成モデリング

複数の視点から世界を知覚するとき、人間は、オブジェクトが部分的な視点から完全に遮られている場合でも、構成的な方法で完全なオブジェクトについて推論する能力を持っています。一方、人間は複数の視点を観察した後、新しい視点を想像することができます。多視点オブジェクト中心学習における最近の目覚ましい進歩は、いくつかの問題を残しています。 2) 新しい視点の予測は、暗黙的なビュールールではなく、高価な視点の注釈に依存します。これにより、エージェントは人間のように機能しなくなります。この論文では、ビデオの時間条件付き生成モデルを紹介します。オブジェクトの完全な形状を正確に再構築するために、異なる潜在表現間のもつれの解消を強化します。ビューの潜在表現は Transformer に基づいて共同で推論され、Slot Attention の順次拡張と協力してオブジェクト中心の表現を学習します。このモデルは新しい機能も実現します。ガウス過程は、ビューの注釈なしで生成および新規ビュー予測のためのビュー潜在変数の事前確率として使用されます。複数の特別に設計された合成データセットでの実験により、提案されたモデルが 1) ビデオ分解を行い、2) オブジェクトの完全な形状を再構築し、3) 視点アノテーションなしで新しい視点予測を行うことができることが示されました。

When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when the object is completely occluded from partial viewpoints. Meanwhile, humans can imagine the novel views after observing multiple viewpoints. The remarkable recent advance in multi-view object-centric learning leaves some problems: 1) the partially or completely occluded shape of objects can not be well reconstructed. 2) the novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit view rules. This makes the agent fail to perform like humans. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of the object accurately, we enhance the disentanglement between different latent representations: view latent representations are jointly inferred based on the Transformer and then cooperate with the sequential extension of Slot Attention to learn object-centric representations. The model also achieves the new ability: Gaussian processes are employed as priors of view latent variables for generation and novel-view prediction without viewpoint annotations. Experiments on multiple specifically designed synthetic datasets have shown that the proposed model can 1) make the video decomposition, 2) reconstruct the complete shapes of objects, and 3) make the novel viewpoint prediction without viewpoint annotations.

updated: Sat Jan 21 2023 13:39:39 GMT+0000 (UTC)

published: Sat Jan 21 2023 13:39:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト