Progressive Localization Networks for Language-based Moment Localization

Qi Zheng; Jianfeng Dong; Xiaoye Qu; Xun Yang; Yabing Wang; Pan Zhou; Baolong Liu; Xun Wang

言語ベースのモーメントローカリゼーションのためのプログレッシブローカリゼーションネットワーク

このペーパーは、言語ベースのビデオモーメントローカリゼーションのタスクを対象としています。このタスクの言語ベースの設定により、ターゲットアクティビティのオープンセットが可能になり、ビデオモーメントの時間的長さが大幅に変化します。ほとんどの既存の方法は、最初にさまざまな時間的長さの十分な候補モーメントをサンプリングし、次にそれらを指定されたクエリと照合してターゲットモーメントを決定することを好みます。ただし、固定の時間粒度で生成された候補モーメントは、モーメント長の大きな変動を処理するには最適ではない場合があります。この目的のために、我々は、ターゲットモーメントを粗いものから細かいものへと漸進的にローカライズする新しい多段階プログレッシブローカリゼーションネットワーク（PLN）を提案します。具体的には、PLNの各ステージにはローカリゼーションブランチがあり、特定の時間的粒度で生成される候補モーメントに焦点を当てています。候補モーメントの時間的粒度は、ステージ間で異なります。さらに、条件付き機能操作モジュールとアップサンプリング接続を考案して、複数のローカリゼーションブランチをブリッジします。このようにして、後の段階で以前に学習した情報を吸収できるため、よりきめ細かいローカリゼーションが容易になります。 3つの公開データセットでの広範な実験は、言語ベースのモーメントのローカリゼーション、特に長いビデオの短いモーメントのローカライズに対して、提案されたPLNの有効性を示しています。

This paper targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the large variation in moment lengths. To this end, we propose a novel multi-stage Progressive Localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner. Specifically, each stage of PLN has a localization branch, and focuses on candidate moments that are generated with a specific temporal granularity. The temporal granularities of candidate moments are different across the stages. Moreover, we devise a conditional feature manipulation module and an upsampling connection to bridge the multiple localization branches. In this fashion, the later stages are able to absorb the previously learned information, thus facilitating the more fine-grained localization. Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos.

updated: Thu Mar 03 2022 15:07:43 GMT+0000 (UTC)

published: Tue Feb 02 2021 03:45:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト