D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations

Sanath Narayan; Hisham Cholakkal; Munawar Hayat; Fahad Shahbaz Khan; Ming-Hsuan Yang; Ling Shao

D2-Net：識別的埋め込みとノイズ除去されたアクティベーションによる弱教師ありアクションのローカリゼーション

この作業は、ビデオレベルの監視を使用してアクションを時間的にローカライズするように努める、D2-Netと呼ばれる弱く監視された時間的アクションローカリゼーションフレームワークを提案します。私たちの主な貢献は、潜在的な埋め込みの識別可能性と、弱い監視によって引き起こされる前景-背景ノイズに関する出力時間クラスのアクティブ化の堅牢性を共同で強化する新しい損失定式化の導入です。提案された定式化は、時間的行動の局在化を強化するための識別的およびノイズ除去損失項を含む。識別用語は、分類の損失を組み込み、トップダウンの注意メカニズムを利用して、潜在的な前景と背景の埋め込みの分離可能性を高めます。ノイズ除去損失項は、ボトムアップの注意メカニズムを使用してビデオ内およびビデオ間の相互情報量を同時に最大化することにより、クラスのアクティブ化における前景と背景のノイズに明示的に対処します。その結果、前景領域の活性化が強調され、背景領域の活性化が抑制されるため、よりロバストな予測が可能になります。包括的な実験は、THUMOS14とActivityNet1.2の2つのベンチマークで実行されます。私たちのD2-Netは、両方のデータセットの既存の方法と比較して良好に機能し、THUMOS14の平均平均精度で3.6％もの高いゲインを達成しています。

This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on both datasets, achieving gains as high as 3.6% in terms of mean average precision on THUMOS14.

updated: Fri Dec 11 2020 16:01:56 GMT+0000 (UTC)

published: Fri Dec 11 2020 16:01:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト