Unsupervised object-centric video generation and decomposition in 3D

Paul Henderson; Christoph H. Lampert

3Dでの監視されていないオブジェクト中心のビデオ生成と分解

ビデオの生成モデリングへの自然なアプローチは、動画を動くオブジェクトの合成として表現することです。最近の作品は、ゆっくりと変化する背景上に2Dスプライトのセットをモデル化していますが、それらを生み出す基礎となる3Dシーンを考慮していません。代わりに、複数の3Dオブジェクトと3D背景を持つシーンを移動しているときに見られるビューとしてビデオをモデル化することを提案します。私たちのモデルは、監督なしで単眼ビデオからトレーニングされていますが、いくつかの動くオブジェクトを含むコヒーレントな3Dシーンを生成することを学びます。最先端の生成的アプローチによってサポートされる視覚的な複雑さを超えて、2つのデータセットに対して詳細な実験を行います。深度予測と3Dオブジェクト検出（以前の作業では対処できないタスク）でメソッドを評価し、2Dインスタンスのセグメンテーションとトラッキングでも優れていることを示します。

A natural approach to generative modeling of videos is to represent them as a composition of moving objects. Recent works model a set of 2D sprites over a slowly-varying background, but without considering the underlying 3D scene that gives rise to them. We instead propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background. Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects. We conduct detailed experiments on two datasets, going beyond the visual complexity supported by state-of-the-art generative approaches. We evaluate our method on depth-prediction and 3D object detection -- tasks which cannot be addressed by those earlier works -- and show it out-performs them even on 2D instance segmentation and tracking.

updated: Wed Mar 24 2021 19:11:43 GMT+0000 (UTC)

published: Tue Jul 07 2020 18:01:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト