Context-LSTM: a robust classifier for video detection on UCF101

Dengshan Li; Rujing Wang

コンテキスト-LSTM：UCF101でのビデオ検出用の堅牢な分類子

ビデオ検出と人間の行動認識は計算コストが高く、モデルのトレーニングに長い時間がかかる場合があります。この論文では、ビデオ検出のトレーニング時間とGPUメモリ使用量を削減することを目的としており、競争力のある検出精度を実現しました。 Two-stream、C3D、TSNなどの他の研究では、UCF101で優れたパフォーマンスが示されています。ここでは、単にビデオ検出のためにLSTM構造を使用しました。単純な構造を使用して、UCF101の検証データセット全体で競争力のあるトップ1の精度を実行しました。 LSTM構造は、深い時間的特徴を処理する可能性があるため、Context-LSTMと呼ばれます。 Context-LSTMは、人間の認識システムをシミュレートする場合があります。 PyTorchでLSTMブロックをカスケード接続し、セル状態フローと非表示の出力フローを接続しました。ブロックの接続では、ReLU、バッチ正規化、およびMaxPooling関数を使用しました。 Context-LSTMは、UCF101検証データセット全体で最先端のトップ1の精度を維持しながら、トレーニング時間とGPUメモリ使用量を削減し、ビデオアクション検出で堅牢なパフォーマンスを示します。

Video detection and human action recognition may be computationally expensive, and need a long time to train models. In this paper, we were intended to reduce the training time and the GPU memory usage of video detection, and achieved a competitive detection accuracy. Other research works such as Two-stream, C3D, TSN have shown excellent performance on UCF101. Here, we used a LSTM structure simply for video detection. We used a simple structure to perform a competitive top-1 accuracy on the entire validation dataset of UCF101. The LSTM structure is named Context-LSTM, since it may process the deep temporal features. The Context-LSTM may simulate the human recognition system. We cascaded the LSTM blocks in PyTorch and connected the cell state flow and hidden output flow. At the connection of the blocks, we used ReLU, Batch Normalization, and MaxPooling functions. The Context-LSTM could reduce the training time and the GPU memory usage, while keeping a state-of-the-art top-1 accuracy on UCF101 entire validation dataset, show a robust performance on video action detection.

updated: Sun Mar 13 2022 09:43:27 GMT+0000 (UTC)

published: Sun Mar 13 2022 09:43:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト