Neural Network Compression using Binarization and Few Full-Precision Weights

Franco Maria Nardini; Cosimo Rulli; Salvatore Trani; Rossano Venturini

二値化と少数の完全精度重みを使用したニューラルネットワーク圧縮

量子化と枝刈りは、ディープニューラルネットワークモデルの 2 つの効果的な圧縮方法であることが知られています。この論文では、量子化とプルーニングを組み合わせた新しい圧縮技術である自動プルーンバイナリゼーション (APB) を提案します。 APB は、いくつかの完全精度の重みを使用してバイナリネットワークの表現能力を強化します。私たちの技術は、各重みを 2 値化するか完全な精度を維持するかを決定することで、ネットワークの精度を最大化しつつ、メモリへの影響を最小限に抑えます。 APB をバイナリ行列と疎密行列の乗算に分解することで、APB を使用して圧縮された層を介して順方向パスを効率的に実行する方法を示します。さらに、非常に効率的なビット単位の演算を活用して、CPU 上で高度に量子化された行列の乗算を行うための 2 つの新しい効率的なアルゴリズムを設計しました。提案されたアルゴリズムは、利用可能な最先端のソリューションよりも 6.9 倍および 1.5 倍高速です。私たちは、広く採用されている 2 つのモデル圧縮データセット、つまり CIFAR10 と ImageNet に対して APB の広範な評価を実行します。 APB は、i) 量子化、ii) プルーニング、および iii) プルーニングと量子化の組み合わせに基づく最先端の方法と比較して、より優れた精度とメモリのトレードオフを実現することを示しています。 APB は精度と効率のトレードオフにおいても量子化を上回っており、精度を損なうことなく 2 ビット量子化モデルよりも最大 2 倍高速です。

Quantization and pruning are known to be two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We perform an extensive evaluation of APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization also in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bits quantized model with no loss in accuracy.

updated: Thu Jun 15 2023 08:52:00 GMT+0000 (UTC)

published: Thu Jun 15 2023 08:52:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト