Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

Bochen Xie; Yongjian Deng; Zhanpeng Shao; Hai Liu; Qingsong Xu; Youfu Li

イベントストリームでの時空間表現学習のためのイベントボクセルセットトランスフォーマー

イベントカメラは、視覚情報をまばらな非同期イベントストリームとして表現するニューロモーフィックビジョンセンサーです。最先端のイベントベースの手法のほとんどは、イベントを高密度のフレームに投影し、従来の学習モデルで処理します。ただし、これらのアプローチでは、イベントデータのスパース性と高い時間分解能が犠牲になり、その結果、モデルサイズが大きくなり、計算が複雑になります。イベントのまばらな性質に適合し、イベント間の関係を十分に調査するために、イベントストリームでの時空間表現学習用の Event Voxel Set Transformer (EVSTr) という名前の新しい注意認識モデルを開発しました。まずイベントストリームをボクセルセットに変換し、次にボクセルの特徴を階層的に集約して堅牢な表現を取得します。 EVSTr のコアは、識別的な時空間特徴を抽出するためのイベントボクセルトランスフォーマーエンコーダーであり、ローカル情報集約のためのマルチスケール隣接埋め込み層 (MNEL) とボクセルセルフアテンションレイヤー (VSAL) を含む 2 つの適切に設計されたコンポーネントで構成されます。グローバル機能の相互作用用。ネットワークに長距離の時間構造を組み込めるようにすることで、セグメント化されたボクセルセットのシーケンスから動作パターンを学習するセグメントモデリング戦略を導入します。提案されたモデルを、オブジェクト分類とアクション認識という 2 つのイベントベースの認識タスクで評価します。包括的な実験により、EVSTr はモデルの複雑さを低く維持しながら最先端のパフォーマンスを達成できることが示されています。さらに、アクション認識用の実世界のイベントベースのデータセットの不足を補うために、困難な視覚シナリオで記録された新しいデータセット (NeuroHAR) を紹介します。

Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore the relationship between them, we develop a novel attention-aware model named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and a Voxel Self-Attention Layer (VSAL) for global feature interactions. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy to learn motion patterns from a sequence of segmented voxel sets. We evaluate the proposed model on two event-based recognition tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to complement the lack of real-world event-based datasets for action recognition.

updated: Thu May 18 2023 07:48:25 GMT+0000 (UTC)

published: Tue Mar 07 2023 12:48:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト