AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks

Cheng Gong; Ye Lu; Surong Dai; Deng Qian; Chenkun Du; Tao Li

AutoQNN: ニューラルネットワークを自動的に量子化するためのエンドツーエンドのフレームワーク

適切な混合精度ポリシーを使用して予想される量子化スキームを調査することは、ディープニューラルネットワーク (DNN) を高い効率と精度で圧縮するための重要なポイントです。この探索は、ドメインの専門家にとって大きなワークロードを意味し、自動圧縮方法が必要です。ただし、自動メソッドの膨大な検索スペースにより、大量の計算予算が導入され、自動プロセスを実際のシナリオに適用することが困難になります。このホワイトペーパーでは、AutoQNN というエンドツーエンドのフレームワークを提案します。このフレームワークは、人間の労力を必要とせずに、さまざまなスキームとビット幅を利用してさまざまなレイヤーを自動的に量子化します。 AutoQNN は、量子化スキーム検索 (QSS)、量子化精度学習 (QPL)、量子化アーキテクチャ生成 (QAG) の 3 つの手法を使用することで、主流の DNN モデルに望ましい量子化スキームと混合精度ポリシーを効率的に探すことができます。 QSS は 5 つの量子化スキームを導入し、スキーム検索の候補セットとして 3 つの新しいスキームを定義し、微分可能なニューラルアーキテクチャ検索 (DNAS) アルゴリズムを使用して、レイヤーまたはモデルに必要なスキームをセットから探します。 QPL は、私たちの知る限り、量子化スキームのビット幅を再パラメータ化することによって混合精度ポリシーを学習する最初の方法です。 QPL は、DNN の分類損失と精度損失の両方を効率的に最適化し、限られたモデルサイズとメモリフットプリント内で比較的最適な混合精度モデルを取得します。 QAG は、エンドツーエンドのニューラルネットワークの量子化を容易にするために、手動の介入なしで任意のアーキテクチャを対応する量子化されたアーキテクチャに変換するように設計されています。 AutoQNN を実装し、Keras に統合しました。広範な実験により、AutoQNN が最先端の量子化よりも一貫して優れていることが実証されています。

Exploring the expected quantizing scheme with suitable mixed-precision policy is the key point to compress deep neural networks (DNNs) in high efficiency and accuracy. This exploration implies heavy workloads for domain experts, and an automatic compression method is needed. However, the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios. In this paper, we propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques: quantizing scheme search (QSS), quantizing precision learning (QPL), and quantized architecture generation (QAG). QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search, and then uses the differentiable neural architecture search (DNAS) algorithm to seek the layer- or model-desired scheme from the set. QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes, to the best of our knowledge. QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention, to facilitate end-to-end neural network quantization. We have implemented AutoQNN and integrated it into Keras. Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.

updated: Fri Apr 07 2023 11:14:21 GMT+0000 (UTC)

published: Fri Apr 07 2023 11:14:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト