Active Learning with Effective Scoring Functions for Semi-Supervised Temporal Action Localization

Ding Li; Xuebing Yang; Yongqiang Tang; Chenyang Zhang; Wensheng Zhang

半教師あり時間的行動位置特定のための効果的なスコアリング機能を備えたアクティブラーニング

時間アクションローカリゼーション (TAL) は、アクションカテゴリと、トリミングされていないビデオのアクションインスタンスの時間境界、つまり開始時間と終了時間の両方を予測することを目的としています。通常、完全に監視されたソリューションは、ほとんどの既存の作品で採用されており、効果的であることが証明されています。これらのソリューションの実際的なボトルネックの 1 つは、大量のラベル付きトレーニングデータが必要になることです。高価な人間のラベルのコストを削減するために、このペーパーでは、半教師あり TAL という名前のめったに調査されていないが実用的なタスクに焦点を当て、AL-STAL という名前の効果的な能動学習方法を提案します。有益性の高いビデオサンプルを積極的に選択し、ローカリゼーションモデルをトレーニングするために、Train、Query、Annotate、Append という 4 つのステップを活用します。 AL-STAL には定位モデルの不確実性を考慮した 2 つのスコアリング機能が搭載されており、ビデオサンプルのランク付けと選択が容易になります。時間的提案エントロピー (TPE) と呼ばれる不確実性の尺度として、予測されたラベル分布のエントロピーを使用します。もう 1 つは、隣接するアクション提案間の相互情報に基づく新しいメトリックを導入し、Temporal Context Inconsistency (TCI) という名前のビデオサンプルの有益性を評価します。提案された方法の有効性を検証するために、THUMOS'14 と ActivityNet 1.3 の 2 つのベンチマークデータセットで広範な実験を行います。実験結果は、AL-STAL が既存の競合他社よりも優れており、完全教師あり学習と比較して満足のいくパフォーマンスを達成することを示しています。

Temporal Action Localization (TAL) aims to predict both action category and temporal boundary of action instances in untrimmed videos, i.e., start and end time. Fully-supervised solutions are usually adopted in most existing works, and proven to be effective. One of the practical bottlenecks in these solutions is the large amount of labeled training data required. To reduce expensive human label cost, this paper focuses on a rarely investigated yet practical task named semi-supervised TAL and proposes an effective active learning method, named AL-STAL. We leverage four steps for actively selecting video samples with high informativeness and training the localization model, named Train, Query, Annotate, Append. Two scoring functions that consider the uncertainty of localization model are equipped in AL-STAL, thus facilitating the video sample rank and selection. One takes entropy of predicted label distribution as measure of uncertainty, named Temporal Proposal Entropy (TPE). And the other introduces a new metric based on mutual information between adjacent action proposals and evaluates the informativeness of video samples, named Temporal Context Inconsistency (TCI). To validate the effectiveness of proposed method, we conduct extensive experiments on two benchmark datasets THUMOS'14 and ActivityNet 1.3. Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.

updated: Wed Aug 31 2022 13:39:38 GMT+0000 (UTC)

published: Wed Aug 31 2022 13:39:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト