Transcript to Video: Efficient Clip Sequencing from Texts

Yu Xiong; Fabian Caba Heilbron; Dahua Lin

ビデオへのトランスクリプト：テキストからの効率的なクリップシーケンス

ウェブ上で共有されている数多くのビデオの中で、よく編集されたビデオは常により多くの注目を集めています。ただし、専門的な専門知識と膨大な手作業が必要なため、経験の浅いユーザーが適切に編集されたビデオを作成することは困難です。非専門家の要求を満たすために、Transcript-to-Videoを紹介します。これは、テキストを入力として使用して、ショットの広範なコレクションからビデオシーケンスを自動的に作成する、弱く監視されたフレームワークです。具体的には、視覚言語表現とモデルショットシーケンススタイルをそれぞれ学習するために、コンテンツ取得モジュールと時間的コヒーレントモジュールを提案します。高速推論のために、リアルタイムビデオクリップシーケンスの効率的な検索戦略を紹介します。定量的な結果とユーザーの研究は、提案された学習フレームワークが、スタイルの観点からもっともらしいビデオシーケンスを作成しながら、コンテンツ関連のショットを取得できることを経験的に示しています。さらに、実行時のパフォーマンス分析は、フレームワークが実際のアプリケーションをサポートできることを示しています。

Among numerous videos shared on the web, well-edited ones always attract more attention. However, it is difficult for inexperienced users to make well-edited videos because it requires professional expertise and immense manual labor. To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles, respectively. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing. Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style. Besides, the run-time performance analysis shows that our framework can support real-world applications.

updated: Sun Jul 25 2021 17:24:50 GMT+0000 (UTC)

published: Sun Jul 25 2021 17:24:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト