Learning Target-aware Representation for Visual Tracking via Informative Interactions

Mingzhe Guo; Zhipeng Zhang; Heng Fan; Liping Jing; Yilin Lyu; Bing Li; Weiming Hu

有益な相互作用を介した視覚追跡のためのターゲット認識表現の学習

追跡のための特徴表現のターゲット知覚能力を改善するために、新しいバックボーンアーキテクチャを紹介します。具体的には、事実上のフレームワークがターゲットのローカリゼーションのためにバックボーンからの出力を使用するだけで機能のマッチングを実行することを観察したため、マッチングモジュールからバックボーンネットワーク、特に浅いレイヤーへの直接フィードバックはありません。より具体的には、マッチングモジュールのみが（参照フレーム内の）ターゲット情報に直接アクセスできますが、候補フレームの表現学習は参照ターゲットを認識しません。結果として、浅い段階でのターゲットに関係のない干渉の蓄積効果は、より深い層の機能品質を低下させる可能性があります。この論文では、シャムのようなバックボーンネットワーク（InBN）内で複数のブランチワイズ相互作用を実行することにより、異なる角度から問題にアプローチします。 InBNの中核となるのは、参照画像の事前知識をバックボーンネットワークのさまざまな段階に注入する、一般的なインタラクションモデラー（GIM）です。これにより、計算コストを無視して、候補フィーチャ表現のターゲット認識と堅牢なディストラクタ抵抗を向上させることができます。提案されたGIMモジュールとInBNメカニズムは一般的であり、複数のベンチマークでの広範な実験から明らかなように、CNNやTransformerなどのさまざまなバックボーンタイプに適用できます。特に、CNNバージョン（SiamCARに基づく）は、LaSOT / TNL2Kでそれぞれ3.2 / 6.9のSUCの絶対ゲインでベースラインを改善します。 Transformerバージョンは、LaSOT / TNL2Kで65.7 / 52.0のSUCスコアを取得します。これは、最近の最先端技術と同等です。コードとモデルがリリースされます。

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Specifically, having observed that de facto frameworks perform feature matching simply using the outputs from backbone for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. More concretely, only the matching module can directly access the target information (in the reference frame), while the representation learning of candidate frame is blind to the reference target. As a consequence, the accumulation effect of target-irrelevant interference in the shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem from a different angle by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). At the core of InBN is a general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced by our extensive experiments on multiple benchmarks. In particular, the CNN version (based on SiamCAR) improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K, respectively. The Transformer version obtains SUC scores of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent state of the arts. Code and models will be released.

updated: Fri Jan 07 2022 16:22:27 GMT+0000 (UTC)

published: Fri Jan 07 2022 16:22:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト