Motion and Context-Aware Audio-Visual Conditioned Video Prediction

Yating Xu; Gim Hee Lee

モーションおよびコンテキストを意識したオーディオビジュアル条件付きビデオ予測

視聴覚調整されたビデオ予測のための既存の最先端の方法は、マルチモーダル確率ネットワークとフレームエンコーダーからの視聴覚フレームの潜在コードを使用して、次の視覚フレームを予測します。ただし、高次元の画像空間のため、潜在コードから次の視覚フレームのピクセルごとの強度を直接推測することは非常に困難です。この目的のために、視聴覚調整されたビデオ予測を動きと外観のモデリングに分離することを提案します。最初の部分は、特定のオーディオビジュアルクリップからモーション情報をオプティカルフローとして学習するマルチモーダルモーション推定モジュールです。 2 番目の部分は、予測されたオプティカルフローを使用して現在のビジュアルフレームを次のビジュアルフレームにワープし、指定されたオーディオビジュアルコンテキストに基づいてそれを改良する、コンテキスト認識改良モジュールです。実験結果は、私たちの方法が既存のベンチマークで競争力のある結果を達成することを示しています。

Existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame from the latent codes is extremely challenging because of the high-dimensional image space. To this end, we propose to decouple the audio-visual conditioned video prediction into motion and appearance modeling. The first part is the multimodal motion estimation module that learns motion information as optical flow from the given audio-visual clip. The second part is the context-aware refinement module that uses the predicted optical flow to warp the current visual frame into the next visual frame and refines it base on the given audio-visual context. Experimental results show that our method achieves competitive results on existing benchmarks.

updated: Sun Apr 23 2023 03:58:57 GMT+0000 (UTC)

published: Fri Dec 09 2022 05:57:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト