Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Huabin Liu; Weixian Lv; John See; Weiyao Lin

少数ショット行動認識のためのタスク適応型時空間ビデオサンプラー

少数ショットアクション認識で直面する主な課題は、トレーニングに不十分なビデオデータです。この問題に対処するために、この分野の現在の方法は、主に機能レベルでのアルゴリズムの考案に焦点を当てていますが、入力ビデオデータの処理にはほとんど注意が払われていません。さらに、既存のフレームサンプリング戦略では、時間的および空間的な次元で重要なアクション情報が省略される可能性があり、ビデオの利用効率にさらに影響を与えます。この論文では、この問題に対処するために、少数ショットアクション認識のための新しいビデオフレームサンプラーを提案します。この場合、タスク固有の時空間フレームサンプリングは、時間セレクター (TS) と空間増幅器 (SA) を介して実現されます。具体的には、サンプラーは最初に、わずかな計算コストでビデオ全体をスキャンして、ビデオフレームのグローバルな認識を取得します。 TS は、最も大きく、その後に寄与するトップ T フレームを選択する役割を果たします。 SA は、顕著性マップのガイダンスを使用して重要な領域を増幅することにより、各フレームの識別情報を強調します。さらに、タスク適応学習を採用して、目前のエピソードタスクに従ってサンプリング戦略を動的に調整します。 TS と SA の両方の実装は、エンドツーエンドの最適化のために微分可能であり、提案されたサンプラーとほとんどの少数ショットアクション認識方法とのシームレスな統合を促進します。広範な実験により、長時間のビデオを含むさまざまなベンチマークでパフォーマンスが大幅に向上することが示されています。コードは https://github.com/R00Kie-Liu/Sampler で入手できます。

A primary challenge faced in few-shot action recognition is inadequate video data for training. To address this issue, current methods in this field mainly focus on devising algorithms at the feature level while little attention is paid to processing input video data. Moreover, existing frame sampling strategies may omit critical action information in temporal and spatial dimensions, which further impacts video utilization efficiency. In this paper, we propose a novel video frame sampler for few-shot action recognition to address this issue, where task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA). Specifically, our sampler first scans the whole video at a small computational cost to obtain a global perception of video frames. The TS plays its role in selecting top-T frames that contribute most significantly and subsequently. The SA emphasizes the discriminative information of each frame by amplifying critical regions with the guidance of saliency maps. We further adopt task-adaptive learning to dynamically adjust the sampling strategy according to the episode task at hand. Both the implementations of TS and SA are differentiable for end-to-end optimization, facilitating seamless integration of our proposed sampler with most few-shot action recognition methods. Extensive experiments show a significant boost in the performances on various benchmarks including long-term videos.The code is available at https://github.com/R00Kie-Liu/Sampler

updated: Thu Dec 22 2022 08:41:53 GMT+0000 (UTC)

published: Wed Jul 20 2022 09:04:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト