Towards Diverse Temporal Grounding under Single Positive Labels

Hao Zhou; Chongyang Zhang; Yanjun Chen; Chuanping Hu

単一の肯定的なラベルの下での多様な時間的グラウンディングに向けて

時間的グラウンディングは、言語クエリによって、トリミングされていないビデオ内の記述されたイベントの瞬間を取得することを目的としています。通常、既存の方法では、アノテーションが正確で一意であることを前提としていますが、多くの場合、1 つのクエリで複数の瞬間が記述される場合があります。したがって、単純に 1 対 1 のマッピングタスクと見なして、単一ラベルのアノテーションを一致させようとすると、最適化中に必然的に偽陰性が発生します。この研究では、このタスクを、単一の正のラベルの条件下での 1 対多の最適化問題として再定式化します。ラベル付けされていない瞬間は、否定的ではなく、観察されていないと見なされ、潜在的な肯定的な瞬間をマイニングして、複数の瞬間の検索を支援します。この設定では、DTG-SPL と呼ばれる新しい Diverse Temporal Grounding フレームワークを提案します。これは、主に正のモーメント推定 (PME) モジュールと多様なモーメント回帰 (DMR) モジュールで構成されます。 PME は、セマンティック再構成情報と予想される正の正則化を活用して、潜在的な正の瞬間をオンラインで明らかにします。これらの疑似陽性の監督の下で、DMR はさまざまなユーザーに会うさまざまな瞬間を並行してローカライズできます。フレームワーク全体により、エンドツーエンドの最適化と高速な推論が可能になります。 Charades-STA と ActivityNet Captions での広範な実験は、単一ラベルと複数ラベルの両方のメトリックに関して、この方法が優れたパフォーマンスを達成することを示しています。

Temporal grounding aims to retrieve moments of the described event within an untrimmed video by a language query. Typically, existing methods assume annotations are precise and unique, yet one query may describe multiple moments in many cases. Hence, simply taking it as a one-vs-one mapping task and striving to match single-label annotations will inevitably introduce false negatives during optimization. In this study, we reformulate this task as a one-vs-many optimization problem under the condition of single positive labels. The unlabeled moments are considered unobserved rather than negative, and we explore mining potential positive moments to assist in multiple moment retrieval. In this setting, we propose a novel Diverse Temporal Grounding framework, termed DTG-SPL, which mainly consists of a positive moment estimation (PME) module and a diverse moment regression (DMR) module. PME leverages semantic reconstruction information and an expected positive regularization to uncover potential positive moments in an online fashion. Under the supervision of these pseudo positives, DMR is able to localize diverse moments in parallel that meet different users. The entire framework allows for end-to-end optimization as well as fast inference. Extensive experiments on Charades-STA and ActivityNet Captions show that our method achieves superior performance in terms of both single-label and multi-label metrics.

updated: Sun Mar 12 2023 02:54:18 GMT+0000 (UTC)

published: Sun Mar 12 2023 02:54:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト