Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Yulin Pan; Xiangteng He; Biao Gong; Yiliang Lv; Yujun Shen; Yuxin Peng; Deli Zhao

スキャンは一度だけ: 長いビデオでの迅速な一時的グラウンディングのためのエンドツーエンドのフレームワーク

ビデオの一時的なグラウンディングは、クエリの説明に一致するビデオセグメントを特定することを目的としています。短い形式のビデオ (数分単位など) の最近の進歩にもかかわらず、長いビデオ (数時間単位など) の一時的なグラウンディングはまだ初期段階にあります。この課題に対処するために、スライディングウィンドウを使用するのが一般的な方法ですが、ウィンドウ内のフレーム数が限られているため、非効率的で柔軟性に欠ける場合があります。この作業では、1 回のネットワーク実行で数時間のビデオをモデル化できる、高速な一時的なグラウンディングのためのエンドツーエンドのフレームワークを提案します。私たちのパイプラインは粗いものから細かいものへと定式化されており、最初に重複していないビデオクリップ (つまり、アンカー) からコンテキストの知識を抽出し、次に詳細なコンテンツの知識でクエリに高度に応答するアンカーを補足します。非常に高いパイプライン効率に加えて、私たちのアプローチのもう 1 つの利点は、ビデオ全体を全体としてモデル化することで、長距離の時間相関をキャプチャできることです。これにより、より正確なグラウンディングが容易になります。実験結果は、長い形式のビデオデータセット MAD と Ego4d で、私たちの方法が最先端技術を大幅に上回り、それぞれ 14.6 倍 / 102.8 倍の高い効率を達成することを示唆しています。コードは https://github.com/afcedf/SOONet.git で公開されます

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (e.g., in minutes), temporal grounding in long videos (e.g., in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with one-time network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (i.e., anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-of-the-arts, and achieves 14.6× / 102.8× higher efficiency respectively. The code will be released at https://github.com/afcedf/SOONet.git

updated: Wed Mar 15 2023 03:54:43 GMT+0000 (UTC)

published: Wed Mar 15 2023 03:54:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト