Natural Language Video Localization: A Revisit in Span-based Question Answering Framework

Hao Zhang; Aixin Sun; Wei Jing; Liangli Zhen; Joey Tianyi Zhou; Rick Siow Mong Goh

自然言語ビデオのローカリゼーション：スパンベースの質問応答フレームワークの再検討

自然言語ビデオローカリゼーション（NLVL）は、テキストクエリに意味的に対応するトリミングされていないビデオからターゲットモーメントを見つけることを目的としています。既存のアプローチは、主に、ランク付け、アンカー、または回帰タスクとして定式化することにより、コンピュータービジョンの観点からNLVL問題を解決します。これらのメソッドは、長いビデオにローカライズするときにパフォーマンスが大幅に低下します。この作業では、入力ビデオをテキストパッセージとして扱うことにより、新しい観点、つまりスパンベースの質問応答（QA）からNLVLに対処します。 NLVLに対応するために、標準のスパンベースのQAフレームワーク（VSLBaseという名前）に加えて、ビデオスパンローカライズネットワーク（VSLNet）を提案します。 VSLNetは、シンプルでありながら効果的なクエリガイドハイライト（QGH）戦略を通じて、NLVLとスパンベースのQAの違いに取り組みます。 QGHは、強調表示された領域内で一致するビデオスパンを検索するようにVSLNetをガイドします。長いビデオのパフォーマンス低下に対処するために、マルチスケールの分割および連結戦略を適用することにより、VSLNetをVSLNet-Lにさらに拡張します。 VSLNet-Lは、最初にトリミングされていないビデオを短いクリップセグメントに分割します。次に、どのクリップセグメントにターゲットモーメントが含まれているかを予測し、他のセグメントの重要性を抑制します。最後に、クリップセグメントは、ターゲットモーメントを正確に特定するために、さまざまな信頼度で連結されます。 3つのベンチマークデータセットでの広範な実験は、提案されたVSLNetおよびVSLNet-Lが最先端の方法よりも優れていることを示しています。 VSLNet-Lは、長いビデオのパフォーマンス低下の問題に対処します。私たちの研究は、スパンベースのQAフレームワークがNLVL問題を解決するための効果的な戦略であることを示唆しています。

Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.

updated: Mon Mar 01 2021 07:58:49 GMT+0000 (UTC)

published: Fri Feb 26 2021 15:57:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト