Explicitly Modeled Attention Maps for Image Classification

Andong Tan; Duc Tam Nguyen; Maximilian Dax; Matthias Nießner; Thomas Brox

画像分類のために明示的にモデル化された注意マップ

自己注意ネットワークは、画像分類などのコンピュータビジョンタスクで目覚ましい進歩を示しています。自己注意メカニズムの主な利点は、注意マップで長距離の特徴の相互作用をキャプチャできることです。ただし、アテンションマップの計算には、学習可能なキー、クエリ、および位置エンコーディングが必要です。これらの使用法は、直感的でなく、計算コストがかかることがよくあります。この問題を軽減するために、計算のオーバーヘッドを低くするために単一の学習可能なパラメーターのみを使用して、明示的にモデル化された注意マップを備えた新しい自己注意モジュールを提案します。幾何学的事前分布を使用して明示的にモデル化されたアテンションマップの設計は、画像内の特定のピクセルの空間コンテキストがその隣接ピクセルによってほとんど支配されている一方で、より離れたピクセルはわずかな寄与しか持たないという観察に基づいています。具体的には、アテンションマップは、入力コンテンツとは独立してモデル化された学習可能な半径を持つ単純な関数（ガウスカーネルなど）を介してパラメーター化されます。私たちの評価は、私たちの方法がImageNet ILSVRCのResNetベースラインに対して最大2.2％の精度向上を達成し、AA-ResNet152などの他の自己注意方法よりも精度が0.9％優れており、パラメーターが6.4％少なく、GFLOPが6.7％少ないことを示しています。。この結果は、画像分類に適用した場合に、幾何学的事前情報を自己注意メカニズムに組み込むことの価値を経験的に示しています。

Self-attention networks have shown remarkable progress in computer vision tasks such as image classification. The main benefit of the self-attention mechanism is the ability to capture long-range feature interactions in attention-maps. However, the computation of attention-maps requires a learnable key, query, and positional encoding, whose usage is often not intuitive and computationally expensive. To mitigate this problem, we propose a novel self-attention module with explicitly modeled attention-maps using only a single learnable parameter for low computational overhead. The design of explicitly modeled attention-maps using geometric prior is based on the observation that the spatial context for a given pixel within an image is mostly dominated by its neighbors, while more distant pixels have a minor contribution. Concretely, the attention-maps are parametrized via simple functions (e.g., Gaussian kernel) with a learnable radius, which is modeled independently of the input content. Our evaluation shows that our method achieves an accuracy improvement of up to 2.2% over the ResNet-baselines in ImageNet ILSVRC and outperforms other self-attention methods such as AA-ResNet152 in accuracy by 0.9% with 6.4% fewer parameters and 6.7% fewer GFLOPs. This result empirically indicates the value of incorporating geometric prior into self-attention mechanism when applied in image classification.

updated: Thu Mar 18 2021 14:18:57 GMT+0000 (UTC)

published: Sun Jun 14 2020 11:47:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト