Boundary Proposal Network for Two-Stage Natural Language Video Localization

Shaoning Xiao; Long Chen; Songyang Zhang; Wei Ji; Jian Shao; Lu Ye; Jun Xiao

2段階の自然言語ビデオローカリゼーションのための境界提案ネットワーク

自然言語ビデオローカリゼーション（NLVL）の問題に対処することを目的としています。つまり、トリミングされていない長いビデオの自然言語の説明に対応するビデオセグメントをローカライズします。最先端のNLVLメソッドは、ほぼ1段階の方法であり、通常は2つのカテゴリに分類できます。1）アンカーベースのアプローチ：最初に一連のビデオセグメント候補を事前定義します（たとえば、スライディングウィンドウによって）。）、次に各候補の分類を行います。 2）アンカーフリーアプローチ：各ビデオフレームの確率を、ポジティブセグメント内の境界または中間フレームとして直接予測します。ただし、どちらの種類の1ステージアプローチにも固有の欠点があります。アンカーベースのアプローチはヒューリスティックルールの影響を受けやすく、さまざまな長さのビデオを処理する機能がさらに制限されます。アンカーフリーのアプローチでは、セグメントレベルの相互作用を活用できないため、結果は劣ります。この論文では、上記の問題を取り除く普遍的な2段階のフレームワークである新しい境界提案ネットワーク（BPNet）を提案します。具体的には、最初の段階で、BPNetはアンカーのないモデルを利用して、境界を持つ高品質の候補ビデオセグメントのグループを生成します。第2段階では、候補と言語クエリ間のマルチモーダル相互作用を共同でモデル化するための視覚言語融合レイヤーが提案され、その後に各候補のアライメントスコアを出力するマッチングスコア評価レイヤーが続きます。 BPNetは、3つの挑戦的なNLVLベンチマーク（つまり、Charades-STA、TACoS、ActivityNet-Captions）で評価されます。これらのデータセットに関する広範な実験とアブレーション研究は、BPNetが最先端の方法よりも優れていることを示しています。

We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.

updated: Mon Mar 15 2021 03:06:18 GMT+0000 (UTC)

published: Mon Mar 15 2021 03:06:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト