MixFormerV2: Efficient Fully Transformer Tracking

Yutao Cui; Tianhui Song; Gangshan Wu; Limin Wang

MixFormerV2: 効率的な完全トランストラッキング

トランスフォーマーベースのトラッカーは、標準ベンチマークで高い精度を達成しています。ただし、その効率は、GPU プラットフォームと CPU プラットフォームの両方での実際の展開にとって依然として障害となっています。この論文では、この問題を克服するために、高密度畳み込み演算や複雑なスコア予測モジュールを使用しない、MixFormerV2 と呼ばれる完全なトランストラッキングフレームワークを提案します。私たちの主な設計は、4 つの特別な予測トークンを導入し、それらをターゲットテンプレートおよび検索領域からのトークンと連結することです。次に、これらの混合トークンシーケンスに統合トランスフォーマーバックボーンを適用します。これらの予測トークンは、混合注意を介してターゲットテンプレートと検索領域の間の複雑な相関関係を捉えることができます。これらに基づいて、単純な MLP ヘッドを通じてトラッキングボックスを簡単に予測し、その信頼スコアを推定できます。 MixFormerV2 の効率をさらに向上させるために、密から疎への蒸留や深から浅への蒸留など、新しい蒸留ベースのモデル削減パラダイムを提案します。前者は、デンスヘッドベースの MixViT から完全なトランストラッカーに知識を転送することを目的としており、後者はバックボーンのいくつかのレイヤーをプルーニングするために使用されます。 2 種類の MixForemrV2 をインスタンス化します。MixFormerV2-B は 165 FPS の高い GPU 速度で、LaSOT で 70.6% の AUC、TNL2k で 57.4% の AUC を達成し、MixFormerV2-S は FEAR-L を 2.7% AUC 上回りました。 LaSOT ではリアルタイムの CPU 速度を実現します。

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as MixFormerV2, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6% on LaSOT and an AUC of 57.4% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7% AUC on LaSOT with a real-time CPU speed.

updated: Wed Feb 07 2024 12:20:21 GMT+0000 (UTC)

published: Thu May 25 2023 09:50:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト