Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Novanto Yudistira; Takio Kurita

相関ネット：アクション認識のための時空間マルチモーダル深層学習

このホワイトペーパーでは、任意のタイムスタンプでマルチモーダル相関をキャプチャするネットワークについて説明します。提案されたスキームは、マルチモーダルたたみ込みニューラルネットワーク（CNN）上の補完的な拡張ネットワークとして動作します。深いCNNによるアクション認識には空間ストリームと時間ストリームが必要ですが、これらの2つのストリームの過剰適合と融合は未解決の問題のままです。既存の融合アプローチは、2つのストリームを平均化します。ここでは、事前に訓練されたCNNを学習するためのシャノンフュージョンとの相関ネットワークを提案します。長距離ビデオは、任意の時間にわたる時空間相関で構成される場合があります。これは、単純に完全に接続されたレイヤーから相関ネットワークを形成することでキャプチャできます。このアプローチは、既存のネットワークフュージョン手法を補完することがわかりました。マルチモーダル相関の重要性は、UCF-101およびHMDB-51データセットの比較実験で検証されています。マルチモーダル相関により、ビデオ認識結果の精度が向上しました。

This paper describes a network that captures multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over a multimodal convolutional neural network (CNN). Spatial and temporal streams are required for action recognition by a deep CNN, but overfitting reduction and fusing these two streams remain open problems. The existing fusion approach averages the two streams. Here we propose a correlation network with a Shannon fusion for learning a pre-trained CNN. A Long-range video may consist of spatiotemporal correlations over arbitrary times, which can be captured by forming the correlation network from simple fully connected layers. This approach was found to complement the existing network fusion methods. The importance of multimodal correlation is validated in comparison experiments on the UCF-101 and HMDB-51 datasets. The multimodal correlation enhanced the accuracy of the video recognition results.

updated: Mon Dec 16 2019 06:57:10 GMT+0000 (UTC)

published: Sun Jul 22 2018 14:48:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト