Multi-Object Tracking with Deep Learning Ensemble for Unmanned Aerial System Applications

Wanlin Xie; Jaime Ide; Daniel Izadi; Sean Banger; Thayne Walker; Ryan Ceresani; Dylan Spagnuolo; Christopher Guagliano; Henry Diaz; Jason Twedt

無人航空機システムアプリケーション向けの深層学習アンサンブルによるマルチオブジェクト追跡

マルチオブジェクトトラッキング（MOT）は、軍事防衛アプリケーションにおける状況認識の重要なコンポーネントです。無人航空機システム（UAS）の使用が増えるにつれ、空中監視のためのMOT手法の需要が高まっています。 UASでのMOTの適用には、センサーの移動、ズームレベルの変更、動的な背景、照明の変更、隠蔽、小さなオブジェクトなどの特定の課題があります。この作業では、リアルタイムの状況でノイズに対応することを目的とした堅牢なオブジェクトトラッキングアーキテクチャを提示します。 Deep Extended Kalman Filter（DeepEKF）と呼ばれる運動学的予測モデルを提案します。このモデルでは、シーケンス間アーキテクチャを使用して、潜在空間内のエンティティの軌道を予測します。 DeepEKFは、学習した画像の埋め込みと、画像内の領域の重要性を重み付けして将来の状態を予測するようにトレーニングされた注意メカニズムを利用します。視覚的なスコアリングでは、さまざまな類似性の尺度を試して、シャムネットワークを使用して事前にトレーニングされた畳み込みニューラルネットワーク（CNN）エンコーダーなど、エンティティの外観に基づいて距離を計算します。初期評価実験では、MHTフレームワーク内で運動学的モデルと視覚的モデルのスコアリング構造を組み合わせた方法により、特にエンティティの動きが予測できないエッジケース、またはデータに大きなギャップがあるフレームが存在する場合に、パフォーマンスが向上することを示します。

Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications. With the growing use of unmanned aerial systems (UASs), MOT methods for aerial surveillance is in high demand. Application of MOT in UAS presents specific challenges such as moving sensor, changing zoom levels, dynamic background, illumination changes, obscurations and small objects. In this work, we present a robust object tracking architecture aimed to accommodate for the noise in real-time situations. We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space. DeepEKF utilizes a learned image embedding along with an attention mechanism trained to weight the importance of areas in an image to predict future states. For the visual scoring, we experiment with different similarity measures to calculate distance based on entity appearances, including a convolutional neural network (CNN) encoder, pre-trained using Siamese networks. In initial evaluation experiments, we show that our method, combining scoring structure of the kinematic and visual models within a MHT framework, has improved performance especially in edge cases where entity motion is unpredictable, or the data presents frames with significant gaps.

updated: Tue Oct 05 2021 13:50:38 GMT+0000 (UTC)

published: Tue Oct 05 2021 13:50:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト