Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

Linjiang Huang; Liang Wang; Hongsheng Li

弱教師あり時間的アクションローカリゼーションのためのフォアグラウンドアクション一貫性ネットワーク

高レベルのビデオ理解の挑戦的なタスクとして、弱く監視された時間的アクションのローカリゼーションがますます注目を集めています。ビデオ注釈のみを使用する場合、ほとんどの既存のメソッドは、分類によるローカリゼーションフレームワークを使用してこのタスクを処理しようとします。このフレームワークは、通常、アクションの確率が高いスニペット、つまりフォアグラウンドを選択するセレクターを採用しています。それにもかかわらず、既存の前景選択戦略には、前景からアクションへの一方的な関係のみを考慮するという大きな制限があり、前景とアクションの一貫性を保証することはできません。この論文では、I3Dバックボーンに基づくFAC-Netという名前のフレームワークを紹介します。このフレームワークには、クラスごとの前景分類ブランチ、クラスに依存しない注意ブランチ、および複数インスタンス学習ブランチという3つのブランチが追加されています。まず、クラスごとの前景分類ブランチは、アクションと前景の関係を正規化して、前景と背景の分離を最大化します。さらに、クラスにとらわれない注意ブランチと複数インスタンス学習ブランチを採用して、フォアグラウンドアクションの一貫性を正規化し、意味のあるフォアグラウンド分類子の学習を支援します。各ブランチ内に、スニペットごとに複数のアテンションスコアを計算するハイブリッドアテンションメカニズムを導入し、識別力のあるスニペットと識別力の低いスニペットの両方に焦点を当てて、アクションの境界全体をキャプチャします。 THUMOS14とActivityNet1.3の実験結果は、私たちの方法の最先端のパフォーマンスを示しています。私たちのコードはhttps://github.com/LeonHLJ/FAC-Netで入手できます。

As a challenging task of high-level video understanding, weakly supervised temporal action localization has been attracting increasing attention. With only video annotations, most existing methods seek to handle this task with a localization-by-classification framework, which generally adopts a selector to select snippets of high probabilities of actions or namely the foreground. Nevertheless, the existing foreground selection strategies have a major limitation of only considering the unilateral relation from foreground to actions, which cannot guarantee the foreground-action consistency. In this paper, we present a framework named FAC-Net based on the I3D backbone, on which three branches are appended, named class-wise foreground classification branch, class-agnostic attention branch and multiple instance learning branch. First, our class-wise foreground classification branch regularizes the relation between actions and foreground to maximize the foreground-background separation. Besides, the class-agnostic attention branch and multiple instance learning branch are adopted to regularize the foreground-action consistency and help to learn a meaningful foreground classifier. Within each branch, we introduce a hybrid attention mechanism, which calculates multiple attention scores for each snippet, to focus on both discriminative and less-discriminative snippets to capture the full action boundaries. Experimental results on THUMOS14 and ActivityNet1.3 demonstrate the state-of-the-art performance of our method. Our code is available at https://github.com/LeonHLJ/FAC-Net.

updated: Sat Aug 14 2021 12:34:44 GMT+0000 (UTC)

published: Sat Aug 14 2021 12:34:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト