Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Xiaoguang Zhu; Ye Zhu; Haoyu Wang; Honglin Wen; Yan Yan; Peilin Liu

アクション認識のためのスケルトンシーケンスとRGBフレームベースのマルチモダリティ機能融合ネットワーク

アクション認識は、ビジョンシステムでの幅広いアプリケーションのために、コンピュータビジョンで注目を集めています。以前のアプローチは、スケルトンシーケンスとRGBビデオのモダリティを融合することによって改善を達成します。ただし、このような方法には、RGBビデオネットワークの高度な複雑さに対する精度と効率の間にジレンマがあります。この問題を解決するために、スケルトンシーケンスとRGBフレームの組み合わせに含まれる重要な情報がスケルトンシーケンスとRGBビデオ。このようにして、複雑さを大幅に軽減しながら、補足情報を保持します。 2つのモダリティの対応をよりよく調査するために、2段階の融合フレームワークがネットワークに導入されています。融合の初期段階では、単一のRGBフレームにスケルトンシーケンスを投影して、RGBフレームが手足の動きの領域に焦点を合わせるのに役立つスケルトンアテンションモジュールを導入します。融合後期段階では、相関関係を利用してスケルトン機能とRGB機能を融合するクロスアテンションモジュールを提案します。 2つのベンチマークNTURGB + DとSYSUでの実験は、提案されたモデルがネットワークの複雑さを軽減しながら、最先端の方法と比較して競争力のあるパフォーマンスを達成することを示しています。

Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods have a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, the complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks NTU RGB+D and SYSU show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reduces the complexity of the network.

updated: Wed Feb 23 2022 09:29:53 GMT+0000 (UTC)

published: Wed Feb 23 2022 09:29:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト