Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Chengmin Gao; Bin Li

ビデオの分解と予測のためのオブジェクト中心表現の時間条件付き生成モデリング

複数の視点から世界を認識する場合、人間は、オブジェクトが特定の視点から完全に遮られている場合でも、構成的な方法で完全なオブジェクトについて推論する能力を備えています。一方、人間は複数の視点を観察することで、新たな視点を想像することができます。多視点オブジェクト中心学習における最近の目覚ましい進歩には、依然としていくつかの未解決の問題が残されています。 1) 部分的または完全に遮蔽されたオブジェクトの形状は、うまく再構築することができません。 2) 新しい視点の予測は、ビュー表現の暗黙のルールではなく、高価な視点の注釈に依存します。この論文では、ビデオの時間条件付き生成モデルを紹介します。オブジェクトの完全な形状を正確に再構成するために、オブジェクトとビューの潜在表現間のもつれの解除を強化します。時間条件付きビューの潜在表現は、Transformer を使用して共同で推論され、スロットアテンションの逐次拡張に入力されます。オブジェクト中心の表現を学びます。さらに、ガウスプロセスは、視点アノテーションなしのビデオ生成および新規ビュー予測のためのビュー潜在変数の事前分布として使用されます。複数のデータセットでの実験により、提案されたモデルがオブジェクト中心のビデオ分解を行い、遮られたオブジェクトの完全な形状を再構築し、新しいビューの予測を行うことができることが実証されました。

When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.

updated: Thu Oct 26 2023 10:07:02 GMT+0000 (UTC)

published: Sat Jan 21 2023 13:39:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト