Minkowski Tracker: A Sparse Spatio-Temporal R-CNN for Joint Object Detection and Tracking

JunYoung Gwak; Silvio Savarese; Jeannette Bohg

Minkowski Tracker: 関節オブジェクトの検出と追跡のためのスパース時空間 R-CNN

マルチタスク学習に関する最近の研究では、単一のニューラルネットワークで関連する問題を解決する利点が明らかになりました。 3D オブジェクト検出とマルチオブジェクトトラッキング (MOT) は、時間の経過とともにオブジェクトインスタンスの位置を予測して関連付ける 2 つの非常に絡み合った問題です。ただし、3D MOT の以前の作業のほとんどは、検出器を先行する分離されたパイプラインとして扱い、検出器の出力をトラッカーへの入力としてバラバラに取ります。この作業では、オブジェクトの検出と追跡を共同で解決するスパース時空間 R-CNN であるミンコフスキートラッカーを紹介します。領域ベースの CNN (R-CNN) に着想を得て、トラックへの割り当て確率を予測するオブジェクト検出器 R-CNN の第 2 段階として追跡を解決することを提案します。まず、Minkowski Tracker は 4D 点群を入力として受け取り、4D スパース畳み込みエンコーダーネットワークを介して時空間鳥瞰図 (BEV) 特徴マップを生成します。次に、提案された TrackAlign は、BEV 機能からトラックの関心領域 (ROI) 機能を集約します。最後に、Minkowski Tracker は、ROI 機能から予測された検出と追跡の一致確率に基づいて、追跡とその信頼スコアを更新します。大規模な実験で、私たちの方法の全体的なパフォーマンスの向上は、次の 4 つの要因によるものであることを示しています。 3. 検出とトラックの一致スコアは、暗黙的な動きモデルを学習して、トラックの割り当てを強化します。 4. 検出とトラックの一致スコアは、トラック信頼スコアの品質を向上させます。その結果、Minkowski Tracker は、手動で設計されたモーションモデルを使用せずに、Nuscenes データセット追跡タスクで最先端のパフォーマンスを達成しました。

Recent research in multi-task learning reveals the benefit of solving related problems in a single neural network. 3D object detection and multi-object tracking (MOT) are two heavily intertwined problems predicting and associating an object instance location across time. However, most previous works in 3D MOT treat the detector as a preceding separated pipeline, disjointly taking the output of the detector as an input to the tracker. In this work, we present Minkowski Tracker, a sparse spatio-temporal R-CNN that jointly solves object detection and tracking. Inspired by region-based CNN (R-CNN), we propose to solve tracking as a second stage of the object detector R-CNN that predicts assignment probability to tracks. First, Minkowski Tracker takes 4D point clouds as input to generate a spatio-temporal Bird's-eye-view (BEV) feature map through a 4D sparse convolutional encoder network. Then, our proposed TrackAlign aggregates the track region-of-interest (ROI) features from the BEV features. Finally, Minkowski Tracker updates the track and its confidence score based on the detection-to-track match probability predicted from the ROI features. We show in large-scale experiments that the overall performance gain of our method is due to four factors: 1. The temporal reasoning of the 4D encoder improves the detection performance 2. The multi-task learning of object detection and MOT jointly enhances each other 3. The detection-to-track match score learns implicit motion model to enhance track assignment 4. The detection-to-track match score improves the quality of the track confidence score. As a result, Minkowski Tracker achieved the state-of-the-art performance on Nuscenes dataset tracking task without hand-designed motion models.

updated: Fri Aug 26 2022 17:39:05 GMT+0000 (UTC)

published: Mon Aug 22 2022 04:47:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト