Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Jiazheng Xing; Mengmeng Wang; Yong Liu; Boyu Mu

少数ショット行動認識のための空間的および時間的モデリングの再検討

空間的および時間的モデリングは、少数ショットアクション認識の最も重要な側面の 1 つです。これまでのほとんどの研究は、重要な低レベルの空間的特徴と短期的な時間的関係を考慮せずに、高レベルの空間表現に基づく長期的な時間的関係のモデリングに主に焦点を当てていました。実際、前者の機能は豊富なローカル意味情報をもたらし、後者の機能は隣接するフレームの動きの特徴をそれぞれ表すことができます。この論文では、SloshNet を提案します。これは、少数ショットアクション認識のための空間的および時間的モデリングをより細かい方法で再検討する新しいフレームワークです。まず、低レベルの空間的特徴を活用するために、特徴融合アーキテクチャ検索モジュールを設計して、低レベルと高レベルの空間的特徴の最適な組み合わせを自動的に検索します。次に、最近のトランスフォーマーに触発されて、抽出された空間的外観の特徴に基づいてグローバルな時間関係をモデル化するための長期時間モデリングモジュールを導入します。一方、隣接するフレーム表現間のモーション特性をエンコードする別の短期時間モデリングモジュールを設計します。その後、埋め込まれた豊富な時空間特徴を共通のフレームレベルクラスプロトタイプマッチャーに供給することによって、最終的な予測を取得できます。提案された SloshNet を、Something-Something V2、Kinetics、UCF101、および HMDB51 を含む 4 つの少数ショットアクション認識データセットで広範囲に検証します。すべてのデータセットで最先端の手法に対して良好な結果を達成しています。

Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.

updated: Sat Apr 08 2023 03:29:05 GMT+0000 (UTC)

published: Thu Jan 19 2023 08:34:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト