Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Chenglin Yang; Siyuan Qiao; Adam Kortylewski; Alan Yuille

局所的に強化された自己注意：自己注意を局所的および文脈的用語として再考する

自己注意は、コンピュータビジョンモデルで普及しています。完全に接続された条件付き確率場（CRF）に触発されて、ローカル用語とコンテキスト用語に分解します。これらはCRFの単項および二項項に対応し、射影行列を使用した注意メカニズムによって実装されます。単項項が出力に与える影響はわずかであり、一方、単項項のみに依存する標準CNNは、さまざまなタスクで優れたパフォーマンスを実現します。そのため、単項項を畳み込みに組み込むことで単項項を拡張し、融合モジュールを使用して単項演算と二項演算を動的に結合するLocally Enhanced Self-Attention（LESA）を提案します。私たちの実験では、自己注意モジュールをLESAに置き換えます。 ImageNetとCOCOの結果は、画像認識、オブジェクト検出、およびインスタンスのセグメンテーションのタスクにおいて、畳み込みおよび自己注意のベースラインに対するLESAの優位性を示しています。コードは公開されています。

Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.

updated: Mon Jul 12 2021 18:00:00 GMT+0000 (UTC)

published: Mon Jul 12 2021 18:00:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト