OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Peng Hu; Xi Peng; Hongyuan Zhu; Mohamed M. Sabry Aly; Jie Lin

OPQ：ワンショットプルーニングによるディープニューラルネットワークの圧縮-量子化

ディープニューラルネットワーク（DNN）は通常、パラメーターが多すぎて数百万の重みパラメーターがあるため、これらの大規模なDNNモデルをスマートフォンなどのリソースに制約のあるハードウェアプラットフォームに展開することは困難です。モデルサイズを大幅に削減するために、プルーニングや量子化などの多数のネットワーク圧縮方法が提案されています。その重要な点は、各レイヤーの適切な圧縮割り当て（プルーニングスパース性や量子化コードブックなど）を見つけることです。既存のソリューションは、圧縮モデルを微調整しながら反復/手動で圧縮割り当てを取得するため、効率の問題が発生します。従来技術とは異なり、本論文では、事前にトレーニングされた重みパラメータのみを使用して圧縮割り当てを分析的に解決する、新しいワンショット剪定量子化（OPQ）を提案します。微調整中、圧縮モジュールは固定され、ウェイトパラメータのみが更新されます。私たちの知る限り、OPQは、事前トレーニングされたモデルが、微調整段階での複雑な反復/手動最適化なしで、剪定と量子化を同時に解決するのに十分であることを明らかにする最初の作業です。さらに、各層のすべてのチャネルが共通のコードブックを共有するように強制する統一されたチャネルごとの量子化方法を提案します。これにより、従来のチャネルごとの量子化によってもたらされる余分なオーバーヘッドを導入することなく、ビットレートの割り当てを低くすることができます。 AlexNet / MobileNet-V1 / ResNet-50を使用したImageNetでの包括的な実験は、最新の方法と比較して大幅に高い圧縮率を取得しながら、私たちの方法が精度とトレーニング効率を向上させることを示しています。

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e.g., pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bit-rate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.

updated: Mon May 23 2022 09:05:25 GMT+0000 (UTC)

published: Mon May 23 2022 09:05:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト