HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously Exploiting Image and Event Modalities

Shristi Das Biswas; Adarsh Kosta; Chamika Liyanagedera; Marco Apolinario; Kaushik Roy

HALSIE: 画像とイベントのモダリティを同時に活用することによるセグメンテーションの学習へのハイブリッドアプローチ

画像とイベントのモダリティを同時に活用することによるセマンティックセグメンテーションの新しいハイブリッドアプローチである HALSIE を紹介します。イベントカメラは、ピクセルごとの強度の変化を検出して非同期の「イベントストリーム」を生成するビジョンセンサーです。これらは、ダイナミックレンジが高く、時間分解能が高く、モーションブラーがないため、標準のフレームベースのカメラよりも大きな利点があります。ただし、イベントは視覚信号の変化するコンポーネントのみを測定するため、シーンコンテキストをエンコードする機能が制限されます。欠落しているコンテキスト情報を補強するために、空間的に密なフレームと時間的に密なイベントを融合することで、きめの細かい予測を備えたセマンティックマップを生成できると仮定します。イベントベースのビジョンにおける以前の研究は、優れたパフォーマンスを達成しましたが、かなりの推論コストがかかり、通常はサイクルあたり 50 mJ を超えています。エンドツーエンドの学習フレームワークを再設計することで、同様のパフォーマンスを維持しながら、推論コストを最大 20 分の 1 に削減します。これを達成するために、私たちの方法は、両方のモダリティの長所を活用して、補完的な機能を効率的に抽出して融合します。特に、HALSIE は、非同期イベントから豊富な時間的キューを提供するスパイキングニューラルネットワーク (SNN) ブランチと、通常のフレームデータから空間情報を抽出してクロスドメイン学習を可能にする標準人工ニューラルネットワーク (ANN) ブランチを備えたデュアルエンコーダーで構成されています。 .当社のハイブリッドネットワークは、実際の DDD-17、MVSEC、および DSEC-Semantic データセットで最先端のパフォーマンスに達し、最大 33 倍のパラメーター効率と有利な推論コスト (サイクルあたり 17.9mJ) を実現し、リソースに制約のあるエッジアプリケーション。さらに、私たちのアプローチにおける設計選択の有効性は、徹底的なアブレーション研究によって証明されています。

We present HALSIE, a novel hybrid approach for semantic segmentation by simultaneously leveraging image and event modalities. Event cameras are vision sensors that detect changes in per-pixel intensity to generate asynchronous 'event streams'. They offer significant advantages over standard frame-based cameras due to their higher dynamic range, higher temporal resolution, and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. To augment the missing contextual information, we postulate that fusing spatially dense frames with temporally dense events can generate semantic maps with fine-grained predictions. Prior work in event-based vision has achieved outstanding performance but with substantial inference cost, typically beyond 50 mJ per cycle. By redesigning the end-to-end learning framework, we reduce inference cost by up to ∼20× while retaining similar performance. To achieve this, our method efficiently extracts and fuses the complementary features, exploiting the best of both modalities. In particular, HALSIE comprises dual-encoders with a Spiking Neural Network (SNN) branch to provide rich temporal cues from asynchronous events, and a standard Artificial Neural Network (ANN) branch for extracting spatial information from regular frame data to enable cross-domain learning. Our hybrid network reaches state-of-the-art performance on real-world DDD-17, MVSEC and DSEC-Semantic datasets with up to ∼33× higher parameter efficiency and favorable inference cost (17.9mJ per cycle), making it suitable for resource-constrained edge applications. Further, the effectiveness of design choices in our approach is evidenced by our thorough ablation study.

updated: Fri Mar 17 2023 19:18:17 GMT+0000 (UTC)

published: Sat Nov 19 2022 17:09:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト