On the Integration of Self-Attention and Convolution

Xuran Pan; Chunjiang Ge; Rui Lu; Shiji Song; Guanfu Chen; Zeyi Huang; Gao Huang

自己注意と畳み込みの統合について

畳み込みと自己注意は、表現学習の2つの強力な手法であり、通常、互いに異なる2つのピアアプローチと見なされます。この論文では、これら2つのパラダイムの計算の大部分が実際には同じ操作で行われるという意味で、それらの間に強い根本的な関係が存在することを示します。具体的には、最初に、カーネルサイズkxkの従来の畳み込みをk ^ 2個の個別の1x1畳み込みに分解し、続いてシフト演算と合計演算を実行できることを示します。次に、自己注意モジュール内のクエリ、キー、および値の射影を複数の1x1畳み込みとして解釈し、注意の重みを計算して値を集計します。したがって、2つのモジュールの両方の最初のステージは同様の操作で構成されます。さらに重要なことに、第1段階は、第2段階と比較して主要な計算の複雑さ（チャネルサイズの2乗）に寄与します。この観察結果は、当然、これら2つの一見異なるパラダイムのエレガントな統合につながります。つまり、純粋な畳み込みまたは自己注意の対応物と比較して計算のオーバーヘッドを最小限に抑えながら、自己注意と畳み込み（ACmix）の両方の利点を享受する混合モデルです。。広範な実験により、私たちのモデルは、画像認識とダウンストリームタスクの競合ベースラインよりも一貫して改善された結果を達成することが示されています。コードと事前トレーニング済みモデルは、https：//github.com/Panxuran/ACmixとhttps://gitee.com/mindspore/modelsでリリースされます。

Convolution and self-attention are two powerful techniques for representation learning, and they are usually considered as two peer approaches that are distinct from each other. In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. Specifically, we first show that a traditional convolution with kernel size k x k can be decomposed into k^2 individual 1x1 convolutions, followed by shift and summation operations. Then, we interpret the projections of queries, keys, and values in self-attention module as multiple 1x1 convolutions, followed by the computation of attention weights and aggregation of the values. Therefore, the first stage of both two modules comprises the similar operation. More importantly, the first stage contributes a dominant computation complexity (square of the channel size) comparing to the second stage. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart. Extensive experiments show that our model achieves consistently improved results over competitive baselines on image recognition and downstream tasks. Code and pre-trained models will be released at https://github.com/Panxuran/ACmix and https://gitee.com/mindspore/models.

updated: Mon Nov 29 2021 14:37:05 GMT+0000 (UTC)

published: Mon Nov 29 2021 14:37:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト