EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Xinyu Liu; Houwen Peng; Ningxin Zheng; Yuqing Yang; Han Hu; Yixuan Yuan

EfficientViT: カスケードグループアテンションを備えたメモリ効率の高いビジョントランスフォーマー

ビジョントランスフォーマーは、その高いモデル機能により大きな成功を収めています。ただし、その優れたパフォーマンスには大量の計算コストが伴うため、リアルタイムアプリケーションには適していません。この論文では、EfficientViT という名前の高速ビジョントランスフォーマーファミリを提案します。既存の変換モデルの速度は、メモリの非効率な演算、特に MHSA のテンソル再形成と要素ごとの関数によって一般的に制限されていることがわかりました。したがって、サンドイッチレイアウトを使用して新しいビルディングブロックを設計します。つまり、効率的な FFN 層間で単一のメモリバインド MHSA を使用し、チャネル通信を強化しながらメモリ効率を向上させます。さらに、アテンションマップは頭全体で高い類似性を共有しており、計算の冗長性をもたらしていることがわかりました。これに対処するために、全機能のさまざまな分割をアテンションヘッドに供給するカスケードグループアテンションモジュールを提案します。これにより、計算コストが節約されるだけでなく、アテンションの多様性も向上します。包括的な実験により、EfficientViT が既存の効率的なモデルよりも優れたパフォーマンスを示し、速度と精度の間で適切なトレードオフが達成されることが実証されました。たとえば、当社の EfficientViT-M5 は、MobileNetV3-Large を精度で 1.9% 上回り、Nvidia V100 GPU と Intel Xeon CPU でそれぞれ 40.4% と 45.2% 高いスループットを実現します。最近の効率的なモデル MobileViT-XXS と比較して、EfficientViT-M2 は 1.8% 優れた精度を達成しながら、GPU/CPU で 5.8 倍/3.7 倍高速に実行され、ONNX 形式に変換すると 7.4 倍高速に実行されます。コードとモデルは https://github.com/microsoft/Cream/tree/main/EfficientViT で入手できます。

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

updated: Thu May 11 2023 17:59:41 GMT+0000 (UTC)

published: Thu May 11 2023 17:59:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト