Event Transformer

Zhihao Li; M. Salman Asif; Zhan Ma

イベントトランスフォーマー

Event Transformer

イベントカメラは、ダイナミックレンジが高く、応答速度が速く、消費電力が少ないバイオビジョンにインスパイアされたカメラであり、最近、広大なビジョンタスクでの使用で大きな注目を集めています。一定の時間間隔で強度フレームを出力する従来のカメラとは異なり、イベントカメラはピクセルの明るさの変化（別名、イベント）を非同期（時間内）およびまばらに（空間内）記録します。既存の方法では、ダウンストリームタスクの事前定義された時間的期間に発生したイベントを集約することがよくあります。これは、きめ細かい時間的イベントのさまざまな動作を見落としているようです。この作業は、イベントトランスフォーマーがネイティブのベクトル化されたテンソル形式でイベントシーケンスを直接処理することを提案します。これは、ローカル時間相関を活用するためのローカルトランスフォーマー（LXformer）、ローカル空間類似性を埋め込むためのスパースコンフォーマー（SCformer）、および時間を効果的に特徴付けるためにグローバル情報をシリアル手段にさらに集約するためのグローバルトランスフォーマー（GXformer）をカスケードします。タスクに使用される効果的な時空間特徴を生成するための入力生イベントからの空間相関。 LXformerとSCformerの両方で、分類に広く使用されている5つの異なるデータセットについて、別の14の既存のアルゴリズムと比較して実験的研究が広範囲にわたって実施されています。定量的な結果は、イベントトランスフォーマーの最先端の分類精度と最小の計算リソース要件を報告し、イベントベースのビジョンタスクにとって実用的に魅力的です。

The event camera is a bio-vision inspired camera with high dynamic range, high response speed, and low power consumption, recently attracting extensive attention for its use in vast vision tasks. Unlike the conventional cameras that output intensity frame at a fixed time interval, event camera records the pixel brightness change (a.k.a., event) asynchronously (in time) and sparsely (in space). Existing methods often aggregate events occurred in a predefined temporal duration for downstream tasks, which apparently overlook varying behaviors of fine-grained temporal events. This work proposes the Event Transformer to directly process the event sequence in its native vectorized tensor format. It cascades a Local Transformer (LXformer) for exploiting the local temporal correlation, a Sparse Conformer (SCformer) for embedding the local spatial similarity, and a Global Transformer (GXformer) for further aggregating the global information in a serial means to effectively characterize the time and space correlations from input raw events for the generation of effective spatiotemporal features used for tasks. %In both LXformer and SCformer, Experimental studies have been extensively conducted in comparison to another fourteen existing algorithms upon five different datasets widely used for classification. Quantitative results report the state-of-the-arts classification accuracy and the least computational resource requirements, of the Event Transformer, making it practically attractive for event-based vision tasks.

updated: Mon Apr 11 2022 15:05:06 GMT+0000 (UTC)

published: Mon Apr 11 2022 15:05:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト