EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

Yutao Chen; Xingning Dong; Tian Gan; Chunluan Zhou; Ming Yang; Qingpei Guo

EVE: 深度マップガイダンスと時間的一貫性制約を使用した効率的なゼロショットテキストベースのビデオ編集

画像拡散モデルの優れたパフォーマンスに動機付けられ、これらのモデルをテキストベースのビデオ編集タスクに拡張しようと努める研究者がますます増えています。それにもかかわらず、現在のビデオ編集タスクは主に、高い微調整コストと限られた生成容量の間のジレンマに悩まされています。画像と比較して、ビデオでは編集中に時間的な一貫性を維持するためにより多くの制約が必要になると推測されます。この目的に向けて、堅牢かつ効率的なゼロショットビデオ編集手法である EVE を提案します。深度マップと時間的一貫性の制約に基づいて、EVE は手頃な計算コストと時間コストで満足のいくビデオ編集結果を導き出します。さらに、公正な比較のために公的に利用可能なビデオ編集データセットが存在しないことを認識し、新しいベンチマーク ZVE-50 データセットを構築します。包括的な実験を通じて、EVE がパフォーマンスと効率の間で満足のいくトレードオフを達成できることを検証しました。将来の研究者を容易にするために、データセットとコードベースをリリースします。

Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and efficient zero-shot video editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE could achieve a satisfactory trade-off between performance and efficiency. We will release our dataset and codebase to facilitate future researchers.

updated: Mon Aug 21 2023 11:36:46 GMT+0000 (UTC)

published: Mon Aug 21 2023 11:36:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト