Background-Click Supervision for Temporal Action Localization

Le Yang; Junwei Han; Tao Zhao; Tianwei Lin; Dingwen Zhang; Jianxin Chen

背景-一時的なアクションのローカリゼーションのための監視をクリックします

弱く監視された時間的アクションのローカリゼーションは、ビデオレベルのラベルからインスタンスレベルのアクションパターンを学習することを目的としています。ここで、重要な課題はアクションとコンテキストの混乱です。この課題を克服するために、最近の1つの作業で、アクションクリック監視フレームワークが構築されています。同様の注釈コストが必要ですが、従来の弱く監視された方法と比較すると、ローカリゼーションのパフォーマンスを着実に向上させることができます。この論文では、既存のアプローチのパフォーマンスのボトルネックが主にバックグラウンドエラーに起因することを明らかにすることにより、アクションフレームではなくバックグラウンドビデオフレームのラベルを使用して、より強力なアクションローカライザーをトレーニングできることを発見しました。この目的のために、アクションクリック監視をバックグラウンドクリック監視に変換し、BackTALと呼ばれる新しいメソッドを開発します。具体的には、BackTALは、背景のビデオフレームに2つのモデリング、つまり位置モデリングと特徴モデリングを実装します。位置モデリングでは、注釈付きのビデオフレームで教師あり学習を行うだけでなく、潜在的なアクションフレームと背景の間のスコアの差を拡大するスコア分離モジュールを設計します。特徴モデリングでは、隣接するフレーム間のフレーム固有の類似性を測定し、時間的畳み込みを計算するときに有益な隣接フレームに動的に対応するアフィニティモジュールを提案します。 3つのベンチマークで広範な実験が行われ、確立されたBackTALの高性能と、提案されたバックグラウンドクリック監視の合理性が実証されています。コードはhttps://github.com/VividLe/BackTALで入手できます。

Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background video frames, i.e. the position modeling and the feature modeling. In position modeling, we not only conduct supervised learning on the annotated video frames but also design a score separation module to enlarge the score differences between the potential action frames and backgrounds. In feature modeling, we propose an affinity module to measure frame-specific similarities among neighboring frames and dynamically attend to informative neighbors when calculating temporal convolution. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of the established BackTAL and the rationality of the proposed background-click supervision. Code is available at https://github.com/VividLe/BackTAL.

updated: Wed Nov 24 2021 12:02:52 GMT+0000 (UTC)

published: Wed Nov 24 2021 12:02:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト