Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

Yuyang Zhao; Enze Xie; Lanqing Hong; Zhenguo Li; Gim Hee Lee

主人公を作る: 専門家集団による一般的なビデオ編集

テキスト駆動の画像およびビデオの拡散モデルは、現実的で多様なコンテンツを生成するという点で前例のない成功を収めています。最近、拡散ベースの生成モデルにおける既存の画像やビデオの編集とバリエーションが大きな注目を集めています。しかし、これまでの作品は、テキストによるコンテンツの編集や、単一の視覚的な手がかりを使用した大まかなパーソナライゼーションの提供に限定されており、きめ細かく詳細な制御が必要な、言葉では言い表せないコンテンツには適していませんでした。これに関して、私たちは、個人が主人公になれるよう力を与えることを目的として、テキストと視覚的な手がかりを利用してビデオを編集する、Make-A-Protagonist と呼ばれる汎用ビデオ編集フレームワークを提案します。具体的には、複数の専門家を活用してソースビデオを解析し、視覚的およびテキストの手がかりをターゲットにし、マスクガイドによるノイズ除去サンプリングを使用して目的の出力を生成するビジュアルテキストベースのビデオ生成モデルを提案します。広範な結果は、Make-A-Protagonist の多用途で優れた編集機能を示しています。

The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, rendering them unsuitable for indescribable content that requires fine-grained and detailed control. In this regard, we propose a generic video editing framework called Make-A-Protagonist, which utilizes textual and visual clues to edit videos with the goal of empowering individuals to become the protagonists. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model that employs mask-guided denoising sampling to generate the desired output. Extensive results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.

updated: Mon Feb 19 2024 02:42:27 GMT+0000 (UTC)

published: Mon May 15 2023 17:59:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト