Edit-A-Video: Single Video Editing with Object-Aware Consistency

Chaehun Shin; Heeseung Kim; Che Hyun Lee; Sang-gil Lee; Sungroh Yoon

Edit-A-Video: オブジェクト認識の一貫性を備えた単一のビデオ編集

テキストからビデオ (TTV) への変換モデルが最近目覚ましい成功を収めたという事実にもかかわらず、TTV をビデオ編集に拡張するためのアプローチはほとんどありませんでした。拡散ベースのテキストから画像 (TTI) モデルに適応する TTV モデルのアプローチに動機付けられて、事前トレーニング済みの TTI モデルと単一のビデオ編集フレームワークのみが与えられたビデオ編集フレームワークを提案します。これを Edit-A-Video と呼びます。フレームワークは次の 2 つの段階で構成されます。(1) 時間モジュールを追加し、ソースビデオを調整することにより、2D モデルを 3D モデルに膨らませます。(2) ソースビデオをノイズに反転し、ターゲットテキストプロンプトとアテンションマップインジェクションを使用して編集します。各段階で、ソースビデオのセマンティック属性の一時的なモデリングと保存が可能になります。ビデオ編集の主な課題の 1 つに、背景の不一致の問題があります。編集に含まれていない領域が、望ましくない一貫性のない一時的な変更に悩まされています。この問題を軽減するために、疎-因果ブレンディング (SC ブレンディング) と呼ばれる新しいマスクブレンディング方法も導入します。編集が適用された領域が滑らかな遷移を示し、未編集領域の時空間的一貫性も達成するように、時間的一貫性を反映するように以前のマスクブレンド方法を改善します。さまざまな種類のテキストとビデオに関する広範な実験結果を提示し、背景の一貫性、テキストの配置、およびビデオ編集の品質に関して、ベースラインと比較して提案された方法の優位性を示します。

Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing. Motivated by approaches on TTV models adapting from diffusion-based text-to-image (TTI) models, we suggest the video editing framework given only a pretrained TTI model and a single pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules and tuning on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. Each stage enables the temporal modeling and preservation of semantic attributes of the source video. One of the key challenges for video editing include a background inconsistency problem, where the regions not included for the edit suffer from undesirable and inconsistent temporal alterations. To mitigate this issue, we also introduce a novel mask blending method, termed as sparse-causal blending (SC Blending). We improve previous mask blending methods to reflect the temporal consistency so that the area where the editing is applied exhibits smooth transition while also achieving spatio-temporal consistency of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.

updated: Fri Nov 17 2023 12:43:46 GMT+0000 (UTC)

published: Tue Mar 14 2023 14:35:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト