Video Exploration via Video-Specific Autoencoders

Kevin Wang; Deva Ramanan; Aayush Bansal

ビデオ固有のオートエンコーダによるビデオ探索

人間が制御可能なビデオ探索を可能にする、単純なビデオ固有のオートエンコーダを紹介します。これには、空間的および時間的超解像、空間的および時間的編集、オブジェクトの削除、ビデオテクスチャ、平均的なビデオ探索、ビデオ内およびビデオ間の対応推定など、さまざまな分析タスクが含まれます。以前の研究では、これらの問題のそれぞれを個別に検討し、さまざまな定式化を提案してきました。この作業では、特定のビデオの複数のフレームで（最初から）トレーニングされた単純なオートエンコーダーにより、さまざまなビデオ処理および編集タスクを実行できることを確認します。私たちのタスクは、2つの重要な観察によって可能になります。（1）オートエンコーダーによって学習された潜在コードは、そのビデオの空間的および時間的プロパティをキャプチャし、（2）オートエンコーダーは、サンプル外の入力をビデオ固有の多様体に投影できます。たとえば、（1）潜在コードを補間することで、一時的な超解像とユーザー制御可能なビデオテクスチャが可能になります。（2）多様体の再投影により、タスクのトレーニングなしで、空間的な超解像、オブジェクトの削除、およびノイズ除去が可能になります。重要なことに、主成分分析による潜在コードの2次元視覚化は、ユーザーがビデオ編集を視覚化して直感的に制御するためのツールとして機能します。最後に、私たちは私たちのアプローチを先行技術と定量的に対比し、監督とタスク固有の知識がなくても、私たちのアプローチはタスクのために特別に訓練された監督されたアプローチと同等に実行できることを発見しました。

We present simple video-specific autoencoders that enables human-controllable video exploration. This includes a wide variety of analytic tasks such as (but not limited to) spatial and temporal super-resolution, spatial and temporal editing, object removal, video textures, average video exploration, and correspondence estimation within and across videos. Prior work has independently looked at each of these problems and proposed different formulations. In this work, we observe that a simple autoencoder trained (from scratch) on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks. Our tasks are enabled by two key observations: (1) latent codes learned by the autoencoder capture spatial and temporal properties of that video and (2) autoencoders can project out-of-sample inputs onto the video-specific manifold. For e.g. (1) interpolating latent codes enables temporal super-resolution and user-controllable video textures; (2) manifold reprojection enables spatial super-resolution, object removal, and denoising without training for any of the tasks. Importantly, a two-dimensional visualization of latent codes via principal component analysis acts as a tool for users to both visualize and intuitively control video edits. Finally, we quantitatively contrast our approach with the prior art and found that without any supervision and task-specific knowledge, our approach can perform comparably to supervised approaches specifically trained for a task.

updated: Wed Mar 31 2021 17:56:13 GMT+0000 (UTC)

published: Wed Mar 31 2021 17:56:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト