GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

Yuxuan Wang; Difei Gao; Licheng Yu; Stan Weixian Lei; Matt Feiszli; Mike Zheng Shou

GEB+: 一般的なイベント境界のキャプション、グラウンディング、検索のベンチマーク

認知科学は、人間が支配的な主題の状態変化によって分離されたイベントの観点からビデオを知覚することを示しました。状態の変化は新しいイベントを引き起こし、知覚される大量の冗長情報の中で最も有用なものの 1 つです。ただし、これまでの研究では、セグメント内の詳細な状態変化を評価することなく、セグメントの全体的な理解に焦点を当てていました。この論文では、Kinetic-GEB+ と呼ばれる新しいデータセットを紹介します。データセットは、12K ビデオの一般的なイベントのステータス変化を説明するキャプションに関連付けられた 170,000 を超える境界で構成されています。この新しいデータセットに基づいて、ステータスの変化を通じて、よりきめ細かく、堅牢で、人間のようなビデオの理解の開発をサポートする 3 つのタスクを提案します。データセット内の多くの代表的なベースラインを評価し、視覚的な違いのための新しい TPD (Temporal-based Pairwise Difference) モデリングメソッドも設計し、大幅なパフォーマンスの向上を実現します。その上、結果は、さまざまな粒度の利用、視覚的な違いの表現、および状態変化の正確な位置特定において、現在の方法には依然として手ごわい課題があることを示しています。さらなる分析は、私たちのデータセットがステータスの変化を理解し、ビデオレベルの理解を向上させるためのより強力な方法の開発を促進できることを示しています.データセットは https://github.com/showlab/GEB-Plus で入手できます

Cognitive science has shown that humans perceive videos in terms of events separated by the state changes of dominant subjects. State changes trigger new events and are one of the most useful among the large amount of redundant information perceived. However, previous research focuses on the overall understanding of segments without evaluating the fine-grained status changes inside. In this paper, we introduce a new dataset called Kinetic-GEB+. The dataset consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos. Upon this new dataset, we propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes. We evaluate many representative baselines in our dataset, where we also design a new TPD (Temporal-based Pairwise Difference) Modeling method for visual difference and achieve significant performance improvements. Besides, the results show there are still formidable challenges for current methods in the utilization of different granularities, representation of visual difference, and the accurate localization of status changes. Further analysis shows that our dataset can drive developing more powerful methods to understand status changes and thus improve video level comprehension. The dataset is available at https://github.com/showlab/GEB-Plus

updated: Wed Aug 10 2022 15:33:03 GMT+0000 (UTC)

published: Fri Apr 01 2022 14:45:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト