Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

Daniel Gehrig; Michelle Rüegg; Mathias Gehrig; Javier Hidalgo Carrio; Davide Scaramuzza

単眼深度予測のための反復非同期マルチモーダルネットワークを使用したイベントとフレームの組み合わせ

イベントカメラは、ピクセルごとの明るさの変化を非同期の「イベント」のストリームとして報告する新しいビジョンセンサーです。それらは、高い時間分解能、高いダイナミックレンジ、モーションブラーがないため、標準のカメラと比較して大きな利点があります。ただし、イベントは視覚信号のさまざまなコンポーネントのみを測定するため、シーンコンテキストをエンコードする機能が制限されます。対照的に、標準のカメラは絶対強度フレームを測定し、シーンのより豊かな表現をキャプチャします。したがって、両方のセンサーは補完的です。ただし、イベントは非同期であるため、特に学習ベースの方法では、イベントを同期イメージと組み合わせるのは依然として困難です。これは、従来のリカレントニューラルネットワーク（RNN）が、追加のセンサーからの非同期で不規則なデータ用に設計されていないためです。この課題に対処するために、リカレント非同期マルチモーダル（RAM）ネットワークを導入します。これは、従来のRNNを一般化して、複数のセンサーからの非同期および不規則なデータを処理します。従来のRNNに触発されたRAMネットワークは、非同期で更新される非表示の状態を維持し、いつでもクエリを実行して予測を生成できます。この新しいアーキテクチャをイベントとフレームを使用した単眼深度推定に適用し、最新の方法よりも平均絶対深度誤差の点で最大30％の改善を示します。イベントを使用したマルチモーダル学習のさらなる研究を可能にするために、イベント、強度フレーム、セマンティックラベル、およびCARLAシミュレーターに記録された深度マップを含む新しいデータセットであるEventScapeをリリースします。

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

updated: Thu Feb 18 2021 13:24:35 GMT+0000 (UTC)

published: Thu Feb 18 2021 13:24:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト