Siamese Attentional Keypoint Network for High Performance Visual Tracking

Peng Gao; Ruyue Yuan; Fei Wang; Liyi Xiao; Hamido Fujita; Yan Zhang

高性能視覚追跡のためのシャム注意キーポイントネットワーク

この論文では、視覚追跡の3つの主要な側面、つまりバックボーンネットワーク、アテンションメカニズム、および検出コンポーネントの影響を調査し、効率的な追跡と正確なローカリゼーションのためにSATINと呼ばれるシャムアテンションキーポイントネットワークを提案します。まず、新しいシャム軽量砂時計ネットワークは、視覚追跡用に特別に設計されています。繰り返されるボトムアップおよびトップダウン推論の利点を活用して、複数のスケールでよりグローバルおよびローカルのコンテキスト情報を取得します。第二に、チャネル間および空間的中間注意情報の両方を活用するために、斬新なクロスアテンションモジュールが利用され、特徴マップの識別機能とローカリゼーション機能の両方を強化できます。第三に、境界ボックスの左上隅点、重心点、および右下隅点を検出することにより、ターゲットオブジェクトをトレースするキーポイント検出アプローチが発明されました。したがって、当社のSATINトラッカーは、より効果的なオブジェクト表現を学習する強力な機能を備えているだけでなく、トレーニングまたはテスト段階での計算およびメモリストレージの効率性も備えています。私たちの知る限り、このアプローチを最初に提案します。余計なものはありませんが、実験結果は、私たちのアプローチが1秒あたり27フレームをはるかに超える速度で、いくつかの最近のベンチマークデータセットで最先端のパフォーマンスを達成することを示しています。

In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.

updated: Sun Dec 29 2019 03:03:41 GMT+0000 (UTC)

published: Tue Apr 23 2019 03:02:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト