Few-Shot Action Localization without Knowing Boundaries

Ting-Ting Xie; Christos Tzelepis; Fan Fu; Ioannis Patras

境界を知らない少数のショットアクションのローカライズ

長く、雑然とした、トリミングされていないビデオでアクションを特定することを学ぶことは困難な作業であり、文献では通常、クラスごとに大量の注釈付きトレーニングサンプルが利用可能であると想定して対処されてきました。境界がわかっているか、または弱い教師ありの設定で、各ビデオのクラスラベルのみがわかっている場合。このペーパーでは、さらに一歩進んで、a) テスト時にターゲットアクションのトリミングされた例が 1 つまたは少数しかない場合、および b) 大量のコレクションがある場合に、トリミングされていないビデオでアクションをローカライズすることを学習できることを示します。クラスラベルアノテーションのみのビデオ (トリミングされたものと、アノテーションが付けられていないトリミングされていないもの) がトレーニングに利用できます。トレーニングとテスト中に使用されるクラス間に重複はありません。そのために、ビデオのペア (トリミングまたはトリミングなし) 間のきめ細かい類似パターンをモデル化する時間的類似性行列 (TSM) を推定することを学習するネットワークを提案し、それらを使用して見られる時間クラスアクティベーションマップ (TCAM) を生成します。または目に見えないクラス。 TCAM は、トリミングされていないビデオのビデオレベルの表現を抽出し、テスト時にアクションを一時的にローカライズする一時的な注意メカニズムとして機能します。私たちの知る限り、エンドツーエンドの方法でトレーニングできる、弱く監視されたワンショット/数ショットアクションローカリゼーションネットワークを提案したのは私たちが初めてです。 THUMOS14 および ActivityNet1.2 データセットでの実験結果は、私たちの方法が、最先端の完全に監視された少数ショットの学習方法に匹敵するか、それ以上のパフォーマンスを達成することを示しています。

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.

updated: Tue Jun 08 2021 07:32:43 GMT+0000 (UTC)

published: Tue Jun 08 2021 07:32:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト