STAF: A Spatio-Temporal Attention Fusion Network for Few-shot Video Classification

Rex Liu; Huanle Zhang; Hamed Pirsiavash; Xin Liu

STAF：数ショットのビデオ分類のための時空間アテンションフュージョンネットワーク

数ショットのビデオ分類のための時空間アテンションフュージョンネットワークであるSTAFを提案します。 STAFはまず、3D畳み込みニューラルネットワーク埋め込みネットワークを適用することにより、ビデオの粗い空間的および時間的特徴を抽出します。次に、自己注意および相互注意ネットワークを使用して、抽出された特徴を微調整します。最後に、STAFは、軽量フュージョンネットワークと最近傍分類器を適用して、各クエリビデオを分類します。 STAFを評価するために、3つのベンチマーク（UCF101、HMDB51、およびSomething-Something-V2）で広範な実験を行います。実験結果は、STAFが最先端の精度を大幅に向上させることを示しています。たとえば、STAFは、UCF101とHMDB51の5方向ワンショット精度をそれぞれ5.3％と7.0％向上させます。

We propose STAF, a Spatio-Temporal Attention Fusion network for few-shot video classification. STAF first extracts coarse-grained spatial and temporal features of videos by applying a 3D Convolution Neural Networks embedding network. It then fine-tunes the extracted features using self-attention and cross-attention networks. Last, STAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. To evaluate STAF, we conduct extensive experiments on three benchmarks (UCF101, HMDB51, and Something-Something-V2). The experimental results show that STAF improves state-of-the-art accuracy by a large margin, e.g., STAF increases the five-way one-shot accuracy by 5.3% and 7.0% for UCF101 and HMDB51, respectively.

updated: Wed Dec 08 2021 20:41:40 GMT+0000 (UTC)

published: Wed Dec 08 2021 20:41:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト