RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation

En Yu; Zhuoling Li; Shoudong Han; Hongwei Wang

RelationTrack：分離表現による関係対応の複数オブジェクト追跡

既存のオンライン複数オブジェクト追跡（MOT）アルゴリズムは、多くの場合、検出と再識別（ReID）の2つのサブタスクで構成されています。推論速度を向上させ、複雑さを軽減するために、現在の方法では通常、これらの二重サブタスクを統合フレームワークに統合しています。それにもかかわらず、検出とReIDにはさまざまな機能が必要です。この問題により、トレーニング手順中に最適化の矛盾が発生します。この矛盾を緩和することを目的として、学習した表現を検出固有の埋め込みとReID固有の埋め込みに分離するGlobal Context Disentangling（GCD）という名前のモジュールを考案します。そのため、このモジュールは、これら2つのサブタスクの異なる要件のバランスを取るための暗黙的な方法を提供します。さらに、先行するMOTメソッドは通常、ローカル情報を利用して検出されたターゲットを関連付け、グローバルなセマンティック関係の考慮を怠っています。この制限を解決するために、Transformerエンコーダーの強力な推論機能と変形可能な注意を組み合わせることにより、Guided Transformer Encoder（GTE）と呼ばれるモジュールを開発します。以前の作品とは異なり、GTEはすべてのピクセルの分析を避け、クエリノードといくつかの自己適応的に選択されたキーサンプル間の関係をキャプチャするためだけに参加します。したがって、計算効率が高くなります。提案されたMOTフレームワーク、つまりRelationTrackの優位性を実証するために、MOT16、MOT17、およびMOT20ベンチマークで広範な実験が実施されました。実験結果は、RelationTrackが以前の方法を大幅に上回り、MOT20でIDF1が70.5％、MOTAが67.2％などの新しい最先端のパフォーマンスを確立したことを示しています。

Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID). In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework. Nevertheless, detection and ReID demand diverse features. This issue would result in an optimization contradiction during the training procedure. With the target of alleviating this contradiction, we devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings. As such, this module provides an implicit manner to balance the different requirements of these two subtasks. Moreover, we observe that preceding MOT methods typically leverage local information to associate the detected targets and neglect to consider the global semantic relation. To resolve this restriction, we develop a module, referred to as Guided Transformer Encoder (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention. Unlike previous works, GTE avoids analyzing all the pixels and only attends to capture the relation between query nodes and a few self-adaptively selected key samples. Therefore, it is computationally efficient. Extensive experiments have been conducted on the MOT16, MOT17 and MOT20 benchmarks to demonstrate the superiority of the proposed MOT framework, namely RelationTrack. The experimental results indicate that RelationTrack has surpassed preceding methods significantly and established a new state-of-the-art performance, e.g., IDF1 of 70.5% and MOTA of 67.2% on MOT20.

updated: Mon May 10 2021 13:00:40 GMT+0000 (UTC)

published: Mon May 10 2021 13:00:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト