Gate-Shift-Fuse for Video Action Recognition

Swathikiran Sudhakaran; Sergio Escalera; Oswald Lanz

ビデオアクション認識用のゲートシフトヒューズ

畳み込みニューラルネットワークは、画像認識の事実上のモデルです。ただし、ビデオ認識用の2DCNNの単純な拡張である3DCNNは、標準のアクション認識ベンチマークで同じ成功を収めていません。 3D CNNのパフォーマンスが低下する主な理由の1つは、計算の複雑さが増し、大規模な注釈付きデータセットを使用してそれらを大規模にトレーニングする必要があることです。 3D CNNの複雑さを軽減するために、3Dカーネル因数分解アプローチが提案されています。既存のカーネル因数分解アプローチは、手作業で設計され、配線された手法に従います。この論文では、Gate-Shift-Fuse（GSF）を提案します。これは、時空間分解における相互作用を制御し、時間の経過とともに特徴を適応的にルーティングし、データに依存する方法でそれらを組み合わせる方法を学習する、新しい時空間特徴抽出モジュールです。 GSFは、グループ化された空間ゲーティングを利用して入力テンソルを分解し、チャネルの重み付けを利用して分解されたテンソルを融合します。 GSFを既存の2DCNNに挿入して、パラメーターと計算のオーバーヘッドを無視できる、効率的で高性能な時空間特徴抽出器に変換できます。 2つの人気のある2DCNNファミリーを使用してGSFの広範な分析を実行し、5つの標準的な行動認識ベンチマークで最先端または競争力のあるパフォーマンスを達成します。コードとモデルは、https：//github.com/swathikirans/GSFで公開されます。

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks. Code and models will be made publicly available at https://github.com/swathikirans/GSF.

updated: Wed Mar 16 2022 19:19:04 GMT+0000 (UTC)

published: Wed Mar 16 2022 19:19:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト