EvConv: Fast CNN Inference on Event Camera Inputs For High-Speed Robot Perception

Sankeerth Durvasula; Yushi Guan; Nandita Vijaykumar

EvConv: 高速ロボット認識のためのイベントカメラ入力での高速 CNN 推論

イベントカメラは、高い時間分解能と広いダイナミックレンジで視覚情報をキャプチャします。これにより、急速に変化する環境で細かい時間粒度 (マイクロ秒など) で視覚情報をキャプチャできます。これにより、イベントカメラは、高速認識、オブジェクト追跡、制御など、急速な動きを伴う高速ロボット工学タスクに非常に役立ちます。ただし、イベントカメラストリームでの畳み込みニューラルネットワーク推論は、現在、イベントカメラが動作する高速でリアルタイムの推論を実行できません。現在の CNN 推論時間は通常、通常のフレームベースのカメラのフレームレートに桁違いに近いです。イベントカメラが提供する高い周波数と高い時間分解能を十分に活用するには、イベントカメラレートでのリアルタイムの推論が必要です。このホワイトペーパーでは、イベントカメラからの入力に対して CNN で高速な推論を可能にする新しいアプローチである EvConv について説明します。イベントカメラから CNN への連続した入力は、それらの間のわずかな違いしかないことがわかります。したがって、連続する入力テンソル間の差、またはインクリメントについて推論を実行することを提案します。これにより、インクリメントが非常にまばらになるため、必要な浮動小数点演算の数 (したがって、推論のレイテンシー) を大幅に削減できます。 EvConv は、イベントカメラからのインクリメントで不規則なスパース性を活用し、ネットワークのすべてのレイヤーにわたってこれらのインクリメントのスパース性を保持するように設計されています。フォワードパスで必要な浮動操作の数が最大 98% 削減されることを示しています。また、深度推定、物体認識、オプティカルフロー推定などのタスクに CNN を使用した推論では、精度をほとんど損なうことなく、最大 1.6 倍の高速化を示しています。

Event cameras capture visual information with a high temporal resolution and a wide dynamic range. This enables capturing visual information at fine time granularities (e.g., microseconds) in rapidly changing environments. This makes event cameras highly useful for high-speed robotics tasks involving rapid motion, such as high-speed perception, object tracking, and control. However, convolutional neural network inference on event camera streams cannot currently perform real-time inference at the high speeds at which event cameras operate - current CNN inference times are typically closer in order of magnitude to the frame rates of regular frame-based cameras. Real-time inference at event camera rates is necessary to fully leverage the high frequency and high temporal resolution that event cameras offer. This paper presents EvConv, a new approach to enable fast inference on CNNs for inputs from event cameras. We observe that consecutive inputs to the CNN from an event camera have only small differences between them. Thus, we propose to perform inference on the difference between consecutive input tensors, or the increment. This enables a significant reduction in the number of floating-point operations required (and thus the inference latency) because increments are very sparse. We design EvConv to leverage the irregular sparsity in increments from event cameras and to retain the sparsity of these increments across all layers of the network. We demonstrate a reduction in the number of floating operations required in the forward pass by up to 98%. We also demonstrate a speedup of up to 1.6X for inference using CNNs for tasks such as depth estimation, object recognition, and optical flow estimation, with almost no loss in accuracy.

updated: Wed Mar 08 2023 15:47:13 GMT+0000 (UTC)

published: Wed Mar 08 2023 15:47:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト