Temporal-wise Attention Spiking Neural Networks for Event Streams Classification

Man Yao; Huanhuan Gao; Guangshe Zhao; Dingheng Wang; Yihan Lin; Zhaoxu Yang; Guoqi Li

イベントストリーム分類のための時間的注意スパイキングニューラルネットワーク

イベントが一般にまばらで不均一であり、マイクロ秒の時間分解能を持つ時空間イベントストリームを効果的かつ効率的に処理する方法は非常に価値があり、さまざまな実際のアプリケーションがあります。スパイキングニューラルネットワーク（SNN）は、脳に触発されたイベントトリガーコンピューティングモデルの1つとして、イベントストリームから効果的な時空間特徴を抽出する可能性があります。ただし、個々のイベントを新しいより高い時間分解能のフレームに集約する場合、既存のSNNモデルは、イベントストリームがまばらで不均一であるため、シリアルフレームの信号対雑音比が異なることを重要視しません。この状況は、既存のSNNのパフォーマンスを妨害します。この作業では、イベントストリームを処理するためのフレームベースの表現を学習するための時間的注意SNN（TA-SNN）モデルを提案します。具体的には、注意の概念を時間的入力に拡張して、トレーニング段階での最終決定のためのフレームの重要性を判断し、推論段階で無関係なフレームを破棄します。 TA-SNNモデルがイベントストリーム分類タスクの精度を向上させることを示します。また、フレームベースの表現のためのマルチスケール時間解像度の影響を研究します。私たちのアプローチは、ジェスチャ認識、画像分類、音声数字認識の3つの異なる分類タスクでテストされています。これらのタスクに関する最新の結果を報告し、わずか60ミリ秒でジェスチャ認識の精度が大幅に向上します（ほぼ19％）。

How to effectively and efficiently deal with spatio-temporal event streams, where the events are generally sparse and non-uniform and have the microsecond temporal resolution, is of great value and has various real-life applications. Spiking neural network (SNN), as one of the brain-inspired event-triggered computing models, has the potential to extract effective spatio-temporal features from the event streams. However, when aggregating individual events into frames with a new higher temporal resolution, existing SNN models do not attach importance to that the serial frames have different signal-to-noise ratios since event streams are sparse and non-uniform. This situation interferes with the performance of existing SNNs. In this work, we propose a temporal-wise attention SNN (TA-SNN) model to learn frame-based representation for processing event streams. Concretely, we extend the attention concept to temporal-wise input to judge the significance of frames for the final decision at the training stage, and discard the irrelevant frames at the inference stage. We demonstrate that TA-SNN models improve the accuracy of event streams classification tasks. We also study the impact of multiple-scale temporal resolutions for frame-based representation. Our approach is tested on three different classification tasks: gesture recognition, image classification, and spoken digit recognition. We report the state-of-the-art results on these tasks, and get the essential improvement of accuracy (almost 19%) for gesture recognition with only 60 ms.

updated: Sun Jul 25 2021 02:28:44 GMT+0000 (UTC)

published: Sun Jul 25 2021 02:28:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト