X-volution: On the unification of convolution and self-attention

Xuanhong Chen; Hang Wang; Bingbing Ni

X-volution: 畳み込みと自己注意の統合について

畳み込みと自己注意は、ディープニューラルネットワークの 2 つの基本的な構成要素として機能します。前者はローカル画像の特徴を線形に抽出し、後者は高次のコンテキスト関係を非ローカルにエンコードします。一次/高次の最先端のアーキテクチャ、つまり CNN またはトランスフォーマーは、異種コンピューティングのため、単一の計算モジュールで両方の演算を同時に適用する原則的な方法を欠いています。視覚的なタスクのためのグローバルドット積のパターンと過度の負担。この作業では、変換された特徴の畳み込み演算を介して自己注意を近似するグローバル自己注意近似スキームを理論的に導出します。近似スキームに基づいて、畳み込みと自己注意操作の両方で構成されるマルチブランチ基本モジュールを確立し、ローカル機能と非ローカル機能の両方の相互作用を統合できます。重要なことに、一度訓練されると、このマルチブランチモジュールは、構造的な再パラメータ化によって条件付きで単一の標準畳み込み演算に変換され、X-volution という名前の純粋な畳み込みスタイルの演算子をレンダリングし、アトミック操作として最新のネットワークにプラグインできるようになります。広範な実験により、提案された X-volution が非常に競争力のある視覚的理解の改善を達成することが示されています (ImageNet 分類で +1.2% トップ 1 精度、COCO 検出およびセグメンテーションで +1.7 ボックス AP および +1.5 マスク AP)。

Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships. Though essentially complementary to each other, i.e., first-/high-order, stat-of-the-art architectures, i.e., CNNs or transformers lack a principled way to simultaneously apply both operations in a single computational module, due to their heterogeneous computing pattern and excessive burden of global dot-product for visual tasks. In this work, we theoretically derive a global self-attention approximation scheme, which approximates a self-attention via the convolution operation on transformed features. Based on the approximated scheme, we establish a multi-branch elementary module composed of both convolution and self-attention operation, capable of unifying both local and non-local feature interaction. Importantly, once trained, this multi-branch module could be conditionally converted into a single standard convolution operation via structural re-parameterization, rendering a pure convolution styled operator named X-volution, ready to be plugged into any modern networks as an atomic operation. Extensive experiments demonstrate that the proposed X-volution, achieves highly competitive visual understanding improvements (+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation).

updated: Mon Jun 07 2021 09:03:46 GMT+0000 (UTC)

published: Fri Jun 04 2021 04:32:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト