VideoXum: Cross-modal Visual and Textural Summarization of Videos

Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman Ho; Jiebo Luo

VideoXum: ビデオのクロスモーダルなビジュアルとテクスチャの要約

ビデオの要約は、ソースビデオから最も重要な情報を抽出して、要約されたクリップまたはテキストの物語を作成することを目的としています。従来、出力がビデオかテキストかに応じて異なる方法が提案されてきたため、視覚的要約とテキスト要約という意味的に関連する 2 つのタスク間の相関関係は無視されていました。ビデオとテキストの新しい共同要約タスクを提案します。目標は、短いビデオクリップと、対応するテキストの要約を長いビデオから生成することです。これはまとめてクロスモーダルサマリーと呼ばれます。生成された短縮ビデオクリップとテキストナラティブは、意味的に適切に配置されている必要があります。この目的のために、まず人間が注釈を付けた大規模なデータセット、VideoXum (X はさまざまなモダリティを指します) を構築します。データセットは、ActivityNet に基づいて再注釈付けされます。長さの要件を満たさない動画を除外した後、14,001 の長い動画が新しいデータセットに残ります。再アノテーション付けされたデータセットの各ビデオには、人間がアノテーションを付けたビデオの要約と、対応する物語の要約があります。次に、新しいエンドツーエンドモデル (VTSUM-BILP) を設計して、提案されたタスクの課題に対処します。さらに、VT-CLIPScore と呼ばれる新しいメトリックを提案して、モダリティ間の要約の意味の一貫性を評価するのに役立ちます。提案されたモデルは、この新しいタスクで有望なパフォーマンスを達成し、将来の研究のベンチマークを確立します。

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

updated: Tue Mar 21 2023 17:51:23 GMT+0000 (UTC)

published: Tue Mar 21 2023 17:51:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト