I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Zhikai Li; Qingyi Gu

I-ViT: 効率的なビジョントランス推論のための整数のみの量子化

ビジョントランスフォーマー (ViT) は、さまざまなコンピュータービジョンアプリケーションで最先端のパフォーマンスを達成しました。ただし、これらのモデルにはかなりのストレージと計算のオーバーヘッドがあり、エッジデバイスでの展開と効率的な推論が困難になります。量子化はモデルの複雑さを軽減するための有望なアプローチであり、二項算術パイプラインにより、量子化されたモデルが効率的な整数のみの推論を実行できるようになります。残念ながら、二項演算は畳み込みニューラルネットワークの均一性条件に基づいており、ViT の非線形コンポーネントには適用できないため、ViT の整数のみの推論は未解決の問題となっています。この論文では、ViT が浮動小数点演算を行わずに、整数演算とビットシフトを使用して推論の計算グラフ全体を実行できるようにする、ViT 用の整数専用量子化スキームである I-ViT を提案します。 I-ViT では、線形演算 (MatMul や Dense など) は 2 項算術による整数専用パイプラインに従い、非線形演算 (Softmax、GELU、LayerNorm など) は提案された軽量整数専用演算によって近似されます。算術メソッド。より具体的には、I-ViT は、整数ビットシフトを使用して対応する浮動小数点演算を近似するように設計された、提案された Shiftmax および ShiftGELU を適用します。さまざまなベンチマークモデルで I-ViT を評価した結果、整数のみの INT8 量子化により、全精度 (FP) ベースラインと同等の (またはわずかに高い) 精度が達成されることがわかりました。さらに、GPUの整数演算器への実際のハードウェア展開にはTVMを活用し、FPモデルと比較して3.72～4.11倍の推論高速化を実現しました。 Pytorch と TVM のコードは両方とも https://github.com/zkkli/I-ViT でリリースされています。

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72∼4.11× inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT.

updated: Mon Aug 07 2023 03:11:49 GMT+0000 (UTC)

published: Mon Jul 04 2022 13:37:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト