Efficient Spatialtemporal Context Modeling for Action Recognition

Congqi Cao; Yue Lu; Yifan Zhang; Dongmei Jiang; Yanning Zhang

行動認識のための効率的な時空間コンテキストモデリング

文脈情報は、行動認識において重要な役割を果たします。ローカル操作では、遠距離恋愛の2つの要素間の関係をモデル化するのが困難です。ただし、任意の2点間のコンテキスト情報を直接モデル化すると、計算とメモリに莫大なコストがかかります。特に、追加の時間的次元があるアクション認識の場合はそうです。セグメンテーションタスクで使用される2D十字注意から着想を得て、アクション認識のためにビデオ内の高密度の長距離時空間コンテキスト情報をモデル化するための反復3D十字注意（RCCA-3D）モジュールを提案します。グローバルコンテキストは、スパースリレーションマップに分解されます。水平線、垂直方向、深さの方向に沿った同じ線上の点間の関係をモデル化し、3D十字構造を形成し、同じ操作を繰り返しメカニズムで複製して、線内の点間の関係をに送信します。ついに時空間空間全体への平面。非ローカル方式と比較して、提案されたRCCA-3Dモジュールは、ビデオコンテキストモデリングのパラメータとFLOPの数を25％と11％削減します。 3つのデータセットで2つの最新の行動認識ネットワークを使用してRCCA-3Dのパフォーマンスを評価し、アーキテクチャを徹底的に分析して、関係マップを因数分解および融合するための最良の方法を取得します。他の最先端の方法との比較は、私たちのモデルの有効性と効率を示しています。

Contextual information plays an important role in action recognition. Local operations have difficulty to model the relation between two elements with a long-distance interval. However, directly modeling the contextual information between any two points brings huge cost in computation and memory, especially for action recognition, where there is an additional temporal dimension. Inspired from 2D criss-cross attention used in segmentation task, we propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range spatiotemporal contextual information in video for action recognition. The global context is factorized into sparse relation maps. We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure, and duplicate the same operation with recurrent mechanism to transmit the relation between points in a line to a plane finally to the whole spatiotemporal space. Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for video context modeling. We evaluate the performance of RCCA-3D with two latest action recognition networks on three datasets and make a thorough analysis of the architecture, obtaining the best way to factorize and fuse the relation maps. Comparisons with other state-of-the-art methods demonstrate the effectiveness and efficiency of our model.

updated: Sat Mar 20 2021 14:48:12 GMT+0000 (UTC)

published: Sat Mar 20 2021 14:48:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト