SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Abdelrahman Shaker; Muhammad Maaz; Hanoona Rasheed; Salman Khan; Ming-Hsuan Yang; Fahad Shahbaz Khan

SwiftFormer: Transformer ベースのリアルタイムモバイルビジョンアプリケーション向けの効率的な Additive Attention

自己注意は、さまざまなビジョンアプリケーションでグローバルコンテキストをキャプチャするための事実上の選択になっています。ただし、画像解像度に関する二次計算の複雑さにより、リアルタイムアプリケーションでの使用、特にリソースに制約のあるモバイルデバイスへの展開が制限されます。畳み込みとセルフアテンションの利点を組み合わせて速度と精度のトレードオフを改善するハイブリッドアプローチが提案されていますが、セルフアテンションでの高価な行列乗算演算はボトルネックのままです。この作業では、二次行列乗算演算を線形要素ごとの乗算に効果的に置き換える、新しい効率的な加法的注意メカニズムを紹介します。私たちの設計は、精度を犠牲にすることなく、キーと値の相互作用を線形レイヤーに置き換えることができることを示しています。以前の最先端の方法とは異なり、自己注意の効率的な定式化により、ネットワークのすべての段階でその使用が可能になります。提案された効率的な付加的注意を使用して、「SwiftFormer」と呼ばれる一連のモデルを構築します。これは、精度とモバイル推論速度の両方の点で最先端のパフォーマンスを実現します。私たちの小さなバリアントは、iPhone 14 でわずか 0.8 ミリ秒の遅延で 78.5% のトップ 1 の ImageNet-1K 精度を達成します。これは、MobileViT-v2 と比較してより正確で 2 倍高速です。コード: https://github.com/Amshaker/SwiftFormer

Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer

updated: Mon Mar 27 2023 17:59:58 GMT+0000 (UTC)

published: Mon Mar 27 2023 17:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト