Salient Positions based Attention Network for Image Classification

Sheng Fang; Kaiyu Li; Zhe Li

画像分類のための顕著な位置に基づく注意ネットワーク

自己注意メカニズムは、長い依存関係をモデル化するという最も重要な利点と、コンピュータービジョンタスクのバリエーションとして広く知られています。非ローカルブロックは、入力特徴マップのグローバル依存関係をモデル化しようとします。グローバルなコンテキスト情報を収集するには、必然的に膨大な量のメモリとコンピューティングリソースが必要になります。これは、過去数年間で広範に研究されてきました。ただし、自己注意スキームにはさらに問題があります。グローバルスコープから収集されたすべての情報は、コンテキストモデリングに役立ちますか?私たちの知る限り、この問題に焦点を当てた研究はほとんどありません。両方の質問を目的として、この論文は、自己注意スキームで生成されたアテンションマップと親和性マトリックスに関するいくつかの興味深い観察に触発された、顕著な位置に基づくアテンションスキーム SPANet を提案します。これらの観察は、自己注意をよりよく理解するために有益であると信じています。 SPANet は、顕著な位置選択アルゴリズムを使用して、アテンションマップの計算に関与する限られた量の顕著な点のみを選択します。このアプローチは、多くのメモリとコンピューティングリソースを節約するだけでなく、入力特徴マップの変換から肯定的な情報を抽出しようとします。実装では、一般的な視覚画像とはまったく異なるチャネル高次元の特徴マップを考慮して、チャネル次元に沿った特徴マップのべき乗を位置の顕著性メトリックとして採用します。一般に、非ローカルブロック法とは異なり、SPANet は、空間次元ではなくチャネル次元に沿って、すべてではなく選択された位置のみを使用してコンテキスト情報をモデル化します。ソースコードは https://github.com/likyoo/SPANet で入手できます。

The self-attention mechanism has attracted wide publicity for its most important advantage of modeling long dependency, and its variations in computer vision tasks, the non-local block tries to model the global dependency of the input feature maps. Gathering global contextual information will inevitably need a tremendous amount of memory and computing resources, which has been extensively studied in the past several years. However, there is a further problem with the self-attention scheme: is all information gathered from the global scope helpful for the contextual modelling? To our knowledge, few studies have focused on the problem. Aimed at both questions this paper proposes the salient positions-based attention scheme SPANet, which is inspired by some interesting observations on the attention maps and affinity matrices generated in self-attention scheme. We believe these observations are beneficial for better understanding of the self-attention. SPANet uses the salient positions selection algorithm to select only a limited amount of salient points to attend in the attention map computing. This approach will not only spare a lot of memory and computing resources, but also try to distill the positive information from the transformation of the input feature maps. In the implementation, considering the feature maps with channel high dimensions, which are completely different from the general visual image, we take the squared power of the feature maps along the channel dimension as the saliency metric of the positions. In general, different from the non-local block method, SPANet models the contextual information using only the selected positions instead of all, along the channel dimension instead of space dimension. Our source code is available at https://github.com/likyoo/SPANet.

updated: Wed Jun 09 2021 11:32:29 GMT+0000 (UTC)

published: Wed Jun 09 2021 11:32:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト