Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Chenxu Luo; Alan Yuille

効率的な行動認識のためのグループ化された時空間集約

時間的推論は、ビデオ分析の重要な側面です。 3D CNNは、時空の特徴を制約なしに共同で探索することで優れたパフォーマンスを示しますが、計算コストも大幅に増加します。以前の研究では、空間フィルターと時間フィルターを分離することで複雑さを軽減しようとしました。本論文では、特徴チャネルを空間的および時間的グループに並列に分解する新しい分解方法を提案する。この分解により、2つのグループが静的キューと動的キューに別々に焦点を合わせることができます。これをグループ化された時空間集約（GST）と呼びます。この分解はパラメーター効率が高く、異なるレイヤーの空間的および時間的特徴の寄与を定量的に分析することができます。時間的推論を必要とするいくつかのアクション認識タスクでモデルを検証し、その有効性を示します。

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

updated: Sat Sep 28 2019 19:03:02 GMT+0000 (UTC)

published: Sat Sep 28 2019 19:03:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト