Adaptive Split-Fusion Transformer

Zixuan Su; Hao Zhang; Jingjing Chen; Lei Pang; Chong-Wah Ngo; Yu-Gang Jiang

アダプティブスプリットフュージョントランス

視覚的コンテンツを理解するためのニューラルネットワークは、最近、畳み込みネットワーク（CNN）からトランスフォーマーに進化しました。以前の（CNN）は、小さなウィンドウのカーネルに依存して地域の手がかりをキャプチャし、確かなローカル表現力を示しています。それどころか、後者（トランスフォーマー）は、全体的な学習のために地域間の長距離のグローバルな接続を確立します。この補完的な性質に触発されて、各技術を最大限に活用するハイブリッドモデルを設計することに関心が高まっています。現在のハイブリッドは、ローカル/グローバルモデリングの重要性を考慮せずに、線形射影の単純な近似としてたたみ込みを置き換えるか、注意を払ってたたみ込みブランチを並置するだけです。これに取り組むために、Adaptive Split-Fusion Transformer（ASF-former）という名前の新しいハイブリッドを提案し、適応重みを使用して畳み込み分岐と注意分岐を異なる方法で処理します。具体的には、ASF-formerエンコーダーは、デュアルパス入力に適合するように機能チャネルを半分に均等に分割します。次に、デュアルパスの出力は、視覚的な手がかりから計算されたスカラーの重み付けと融合されます。また、効率を考慮して畳み込みパスをコンパクトに設計します。 ImageNet-1K、CIFAR-10、CIFAR-100などの標準ベンチマークでの広範な実験では、ASFフォーマーが、精度の点でCNN、トランスフォーマー、ハイブリッドパイロットよりも優れていることが示されています（ImageNet-1Kで83.9％）。同様の条件下（12.9G MAC / 56.7Mパラメータ、大規模な事前トレーニングなし）。コードはhttps://github.com/szx503045266/ASF-formerで入手できます。

Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.

updated: Wed Aug 16 2023 17:09:41 GMT+0000 (UTC)

published: Tue Apr 26 2022 10:00:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト