DiffusionTrack: Diffusion Model For Multi-Object Tracking

Run Luo; Zikai Song; Lintao Ma; Jinlin Wei; Wei Yang; Min Yang

DiffusionTrack: マルチオブジェクト追跡用の拡散モデル

マルチオブジェクトトラッキング (MOT) は、単一フレーム内の個々のオブジェクトを検出し、それらを複数のフレームにわたって関連付けることを目的とした、挑戦的なビジョンタスクです。最近の MOT アプローチは、2 段階の検出による追跡 (TBD) 方法と 1 段階の結合検出および追跡 (JDT) 方法に分類できます。これらのアプローチは成功しているにもかかわらず、グローバルまたはローカルでの有害な不整合、堅牢性とモデルの複雑さの間のバランスの悪さ、同じビデオ内のさまざまなシーンでの柔軟性の欠如など、共通の問題にも悩まされています。この論文では、オブジェクトの検出と関連付けを、ペアのノイズボックスからペアのグランドトゥルースボックスへの一貫したノイズ除去拡散プロセスとして共同で定式化する、シンプルだが堅牢なフレームワークを提案します。この新しい漸進的ノイズ除去拡散戦略により、トラッカーの有効性が大幅に向上し、さまざまなオブジェクトを区別できるようになります。トレーニング段階では、ペアのオブジェクトボックスがペアのグラウンドトゥルースボックスからランダムな分布に拡散し、モデルはこのノイズプロセスを逆にすることで検出と追跡を同時に学習します。推論では、モデルは、柔軟な 1 ステップまたは複数ステップのノイズ除去拡散プロセスでの検出および追跡結果に合わせて、ランダムに生成されたペアのボックスを改良します。 MOT17、MOT20、Dancetrack など、広く使用されている 3 つの MOT ベンチマークに関する広範な実験により、当社のアプローチが現在の最先端の手法と比較して競争力のあるパフォーマンスを達成できることが実証されました。

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.

updated: Sat Aug 19 2023 04:48:41 GMT+0000 (UTC)

published: Sat Aug 19 2023 04:48:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト