CLIP-It! Language-Guided Video Summarization

Medhini Narasimhan; Anna Rohrbach; Trevor Darrell

CLIP-It！言語ガイド付きビデオ要約

一般的な動画の概要は、ストーリー全体を伝え、最も重要なシーンを取り上げた動画の要約版です。ただし、動画のシーンの重要性は主観的なものであることが多く、ユーザーは自然言語を使用して要約をカスタマイズし、自分にとって何が重要かを指定するオプションを選択できる必要があります。さらに、完全自動の一般的な要約のための既存のモデルは、顕著性の効果的な事前確率として役立つ可能性のある利用可能な言語モデルを活用していません。この作品では、CLIP-Itを紹介します。これは、一般的なビデオ要約とクエリに焦点を合わせたビデオ要約の両方に対処するための単一のフレームワークであり、通常、文献では個別にアプローチされます。相互の重要性と、ユーザー定義クエリ（クエリに焦点を当てた要約の場合）または自動生成された高密度ビデオキャプション（一般的な場合）との相関に基づいて、ビデオ内のフレームのスコアリングを学習する言語ガイド付きマルチモーダルトランスフォーマーを提案します。ビデオ要約）。私たちのモデルは、グラウンドトゥルースの監視なしでトレーニングすることにより、監視なしの設定に拡張できます。標準のビデオ要約データセット（TVSumとSumMe）とクエリに焦点を合わせたビデオ要約データセット（QFVS）の両方で、ベースラインと以前の作業を大幅に上回っています。特に、転送設定の大幅な改善を実現し、メソッドの強力な一般化機能を証明しています。

A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.

updated: Wed Dec 08 2021 01:30:47 GMT+0000 (UTC)

published: Thu Jul 01 2021 17:59:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト