Hierarchical Video-Moment Retrieval and Step-Captioning

Abhay Zala; Jaemin Cho; Satwik Kottur; Xilun Chen; Barlas Oğuz; Yasher Mehdad; Mohit Bansal

階層的なビデオモーメントの検索とステップキャプション

大規模なビデオコーパスから情報を検索することへの関心が高まっています。以前の研究では、ビデオコーパスから共同で検索して要約を生成できるエンドツーエンドのセットアップなしで、テキストベースのビデオ検索、モーメント検索、ビデオ要約、ビデオキャプションなどの関連タスクを個別に研究してきました。このようなエンドツーエンドのセットアップは、多くの興味深いアプリケーションを可能にします。たとえば、ビデオコーパスから関連するビデオを見つけ、そのビデオから最も関連性の高い瞬間を抽出し、その瞬間をキャプション付きの重要なステップに分割するテキストベースの検索です。 .これに対処するために、HiREST (HIerarchical REtrieval and STep-captioning) データセットを提示し、教育ビデオコーパスからの階層的な情報検索と視覚的/テキスト的段階的要約をカバーする新しいベンチマークを提案します。 HiREST は、教育ビデオデータセットからの 3.4K のテキストとビデオのペアで構成されます。1.1K のビデオには、テキストクエリに関連するモーメントスパンの注釈と、キャプションとタイムスタンプを含む主要な命令ステップへの各瞬間の内訳があります (合計 8.6K のステップキャプション)。私たちの階層的ベンチマークは、ビデオ検索、モーメント検索、および 2 つの新しいモーメントセグメンテーションとステップキャプションタスクで構成されています。瞬間セグメンテーションでは、モデルはビデオの瞬間を命令ステップに分解し、開始と終了の境界を識別します。ステップのキャプションでは、モデルは各ステップのテキストの要約を生成します。また、新しいベンチマークの出発点となるタスク固有のエンドツーエンドのジョイントベースラインモデルも提示します。ベースラインモデルはいくつかの有望な結果を示していますが、コミュニティによる将来の改善の余地がまだたくさんあります。プロジェクトのウェブサイト: https://hirest-cvpr2023.github.io

There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions. To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks. In moment segmentation, models break down a video moment into instruction steps and identify start-end boundaries. In step captioning, models generate a textual summary for each step. We also present starting point task-specific and end-to-end joint baseline models for our new benchmark. While the baseline models show some promising results, there still exists large room for future improvement by the community. Project website: https://hirest-cvpr2023.github.io

updated: Wed Mar 29 2023 02:33:54 GMT+0000 (UTC)

published: Wed Mar 29 2023 02:33:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト