Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

Yanjing Li; Sheng Xu; Baochang Zhang; Xianbin Cao; Peng Gao; Guodong Guo

Q-ViT: 正確で完全に量子化された低ビットビジョントランスフォーマー

大規模な事前トレーニング済みのビジョントランスフォーマー (ViT) は、さまざまなビジュアルタスクで優れたパフォーマンスを発揮しますが、リソースに制約のあるデバイスに展開すると、計算コストとメモリコストの問題が発生します。強力な圧縮手法の中でも、量子化は低ビットパラメーターとビット単位の操作により、計算とメモリの消費を大幅に削減します。ただし、低ビット ViT はほとんど未調査のままであり、通常、実数値の ViT と比較してパフォーマンスが大幅に低下します。この作業では、広範な経験的分析を通じて、最初に、低ビット量子化された自己注意マップの情報の歪みに起因する深刻なパフォーマンス低下のボトルネックを特定します。次に、完全に量子化されたビジョントランスフォーマー (Q-ViT) 用の情報整流モジュール (IRM) と分配誘導蒸留 (DGD) スキームを開発して、このような歪みを効果的に排除し、完全に量子化された ViT を実現します。人気のある DeiT および Swin バックボーンでメソッドを評価します。広範な実験結果は、我々の方法が先行技術よりもはるかに優れた性能を達成することを示しています。たとえば、当社の Q-ViT は理論的に ViT-S を 6.14 倍高速化し、約 80.9% のトップ 1 精度を達成し、ImageNet データセットで完全精度の対応物を 1.0% 上回っています。コードとモデルは https://github.com/YanjingLi0202/Q-ViT に添付されています

The large pre-trained vision transformers (ViTs) have demonstrated remarkable performance on various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. Among the powerful compression approaches, quantization extremely reduces the computation and memory consumption by low-bit parameters and bit-wise operations. However, low-bit ViTs remain largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through extensive empirical analysis, we first identify the bottleneck for severe performance drop comes from the information distortion of the low-bit quantized self-attention map. We then develop an information rectification module (IRM) and a distribution guided distillation (DGD) scheme for fully quantized vision transformers (Q-ViT) to effectively eliminate such distortion, leading to a fully quantized ViTs. We evaluate our methods on popular DeiT and Swin backbones. Extensive experimental results show that our method achieves a much better performance than the prior arts. For example, our Q-ViT can theoretically accelerates the ViT-S by 6.14x and achieves about 80.9% Top-1 accuracy, even surpassing the full-precision counterpart by 1.0% on ImageNet dataset. Our codes and models are attached on https://github.com/YanjingLi0202/Q-ViT

updated: Thu Oct 13 2022 04:00:29 GMT+0000 (UTC)

published: Thu Oct 13 2022 04:00:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト