Robustifying Token Attention for Vision Transformers

Yong Guo; David Stutz; Bernt Schiele

ビジョントランスフォーマーのトークンアテンションを強化する

ビジョントランスフォーマー (ViT) の成功にもかかわらず、ノイズやぼかしなどの一般的な破損があると、依然として精度が大幅に低下します。興味深いことに、ViTs のアテンションメカニズムは、いくつかの重要なトークンに依存する傾向があることがわかりました。これは、トークンオーバーフォーカスと呼ばれる現象です。さらに重要なことに、これらのトークンは破損に対して堅牢ではなく、多くの場合、非常に異なる注意パターンにつながります.このホワイトペーパーでは、このオーバーフォーカスの問題を軽減し、2 つの一般的な手法を使用して注意をより安定させることを目的としています。まず、トークンを認識する平均プーリング (TAP) モジュールは、各トークンのローカル近傍が注意メカニズムに参加するように促します。具体的には、TAP は各トークンの平均プーリングスキームを学習し、近隣の潜在的に重要なトークンの情報を適応的に考慮できるようにします。次に、注意分散損失 (ADL) を使用して、少数のトークンだけに集中するのではなく、多様な入力トークンのセットから出力トークンに情報を集約するように強制します。これは、異なるトークンの注意ベクトル間の高いコサイン類似性にペナルティを課すことで実現します。実験では、私たちの方法を広範囲の変圧器アーキテクチャに適用し、ロバスト性を大幅に改善します。たとえば、最新の堅牢なアーキテクチャ FAN に基づいて、ImageNet-C の破損の堅牢性を 2.4% 向上させると同時に、精度を 0.4% 向上させます。また、セマンティックセグメンテーションタスクを微調整すると、CityScapes-C のロバスト性が 2.4%、ACDC のロバスト性が 3.1% 向上します。

Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4% while simultaneously improving accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when finetuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4% and ACDC by 3.1%.

updated: Tue Apr 04 2023 03:28:27 GMT+0000 (UTC)

published: Mon Mar 20 2023 14:04:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト