MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Manasa Manohara; Sankalp Dayal; Tarqi Afzal; Rahul Bakshi; Kahkuen Fu

MRQ:モデルの再量子化による複数の量子化スキームのサポート

多様なハードウェアアクセラレータ (NPU、TPU、DPU など) が急増しているにもかかわらず、複雑なモデルの量子化と変換のため、固定小数点ハードウェアを備えたエッジデバイスに深層学習モデルを展開することは依然として困難です。 Tensorflow QAT [1]、TFLite PTQ [2]、Qualcomm AIMET [3] などの既存のモデル量子化フレームワークは、限られたセットの量子化スキームのみをサポートしています (たとえば、TF1.x QAT [4] の非対称テンソルごとの量子化のみ)。したがって、主に量子化要件がわずかに異なるため、深層学習モデルをさまざまな固定小数点ハードウェアに対して簡単に量子化することはできません。この論文では、MRQ (モデル再量子化) と呼ばれる新しいタイプのモデル量子化アプローチを想定しています。これは、既存の量子化モデルを取得し、さまざまな量子化要件 (例: 非対称 -> 対称、非累乗など) を満たすようにモデルを迅速に変換します。 -2 スケール -> 2 のべき乗スケール)。再量子化は、コストのかかる再トレーニングを回避し、複数の量子化スキームを同時にサポートするため、最初から量子化するよりもはるかに簡単です。再量子化誤差を最小限に抑えるために、重み補正と丸め誤差の折りたたみを含む新しい再量子化アルゴリズムのセットを開発しました。我々は、MobileNetV2 QAT モデル [7] を 0.64 単位未満の精度損失で 2 つの異なる量子化スキーム (つまり、対称および対称 + 2 のべき乗スケール) に迅速に再量子化できることを実証しました。私たちの研究は、この再量子化の概念をモデルの量子化に初めて利用したものであり、再量子化プロセスから得られたモデルは、Echo Show デバイスの NNA に正常に展開されたと考えています。

Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.

updated: Tue Aug 01 2023 08:15:30 GMT+0000 (UTC)

published: Tue Aug 01 2023 08:15:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト