RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Yabin Zhu; Chenglong Li; Xiao Wang; Jin Tang; Zhixiang Huang

動的ガイド付き学習を備えたプログレッシブフュージョントランスフォーマーによる RGBT トラッキング

既存の Transformer ベースの RGBT 追跡方法は、相互注意を使用して 2 つのモダリティを融合するか、自己注意と相互注意を使用して、モダリティ固有の情報とモダリティ共有情報の両方をモデル化します。ただし、モダリティ間の外観のギャップが大きいため、融合プロセス中の特定のモダリティの機能表現能力が制限されます。この問題に対処するために、ProFormer と呼ばれる新しいプログレッシブフュージョントランスフォーマーを提案します。これは、堅牢な RGBT 追跡のために、単一モダリティ情報をマルチモーダル表現に徐々に統合します。特に、ProFormer は最初にセルフアテンションモジュールを使用してマルチモーダル表現を共同で抽出し、次に 2 つのクロスアテンションモジュールを使用してそれぞれデュアルモダリティの機能と相互作用させます。このようにして、モダリティ固有の情報をマルチモーダル表現で有効化することができます。最後に、フィードフォワードネットワークを使用して、相互作用する 2 つのマルチモーダル表現を融合し、最終的なマルチモーダル表現をさらに強化します。さらに、RGBT トラッカーの既存の学習方法は、最終的な分類のためにマルチモーダル機能を 1 つに融合するか、競合学習戦略を通じてユニモーダルブランチと融合ブランチの関係を利用します。ただし、それらは単一モダリティ分岐の学習を無視するか、1 つの分岐が十分に最適化されないという結果になります。これらの問題を解決するために、各ブランチの表現能力を高めるために、パフォーマンスの良いブランチを適応的に使用して他のブランチの学習を導く動的誘導学習アルゴリズムを提案します。広範な実験により、提案された ProFormer が RGBT210、RGBT234、LasHeR、および VTUAV データセットで新しい最先端のパフォーマンスを設定することが実証されています。

Existing Transformer-based RGBT tracking methods either use cross-attention to fuse the two modalities, or use self-attention and cross-attention to model both modality-specific and modality-sharing information. However, the significant appearance gap between modalities limits the feature representation ability of certain modalities during the fusion process. To address this problem, we propose a novel Progressive Fusion Transformer called ProFormer, which progressively integrates single-modality information into the multimodal representation for robust RGBT tracking. In particular, ProFormer first uses a self-attention module to collaboratively extract the multimodal representation, and then uses two cross-attention modules to interact it with the features of the dual modalities respectively. In this way, the modality-specific information can well be activated in the multimodal representation. Finally, a feed-forward network is used to fuse two interacted multimodal representations for the further enhancement of the final multimodal representation. In addition, existing learning methods of RGBT trackers either fuse multimodal features into one for final classification, or exploit the relationship between unimodal branches and fused branch through a competitive learning strategy. However, they either ignore the learning of single-modality branches or result in one branch failing to be well optimized. To solve these problems, we propose a dynamically guided learning algorithm that adaptively uses well-performing branches to guide the learning of other branches, for enhancing the representation ability of each branch. Extensive experiments demonstrate that our proposed ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.

updated: Sun Mar 26 2023 16:55:58 GMT+0000 (UTC)

published: Sun Mar 26 2023 16:55:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト