TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Wei Gao; Fang Wan; Xingjia Pan; Zhiliang Peng; Qi Tian; Zhenjun Han; Bolei Zhou; Qixiang Ye

TS-CAM：弱教師ありオブジェクトローカリゼーションのためのトークンセマンティック結合注意マップ

弱教師ありオブジェクトローカリゼーション（WSOL）は、画像カテゴリラベルが与えられた場合に困難な問題ですが、オブジェクトローカリゼーションモデルを学習する必要があります。分類のために畳み込みニューラルネットワーク（CNN）を最適化すると、完全なオブジェクト範囲を無視しながらローカルの識別領域がアクティブ化される傾向があり、部分的なアクティブ化の問題が発生します。この論文では、部分的な活性化は、畳み込み演算が局所受容野を生成し、ピクセル間の長距離特徴依存性をキャプチャするのが困難であるCNNの固有の特性によって引き起こされると主張します。トークンセマンティック結合注意マップ（TS-CAM）を導入して、長距離依存関係抽出のためのビジュアルトランスフォーマーの自己注意メカニズムを最大限に活用します。 TS-CAMは、最初に画像を空間埋め込み用の一連のパッチトークンに分割します。これにより、部分的なアクティブ化を回避するために、長距離の視覚的依存関係のアテンションマップが生成されます。次に、TS-CAMは、パッチトークンにカテゴリ関連のセマンティクスを再割り当てし、各トークンがオブジェクトカテゴリを認識できるようにします。 TS-CAMは最終的に、パッチトークンをセマンティックに依存しないアテンションマップと結合して、セマンティックに対応したローカリゼーションを実現します。 ILSVRC / CUB-200-2011データセットでの実験では、TS-CAMがCNN-CAMのWSOLより7.1％/ 27.1％優れており、最先端のパフォーマンスを達成していることが示されています。

Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.

updated: Mon Jun 21 2021 09:45:10 GMT+0000 (UTC)

published: Sat Mar 27 2021 09:43:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト