Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance

Chen Tang; Kai Ouyang; Zhi Wang; Yifei Zhu; Yaowei Wang; Wen Ji; Wenwu Zhu

学習した層ごとの重要性による混合精度ニューラルネットワークの量子化

混合精度量子化（MPQ）の指数関数的に大きな離散探索空間は、各層の最適なビット幅を決定することを困難にします。以前の作品は通常、トレーニングセットの反復検索方法に頼っています。これは数百または数千のGPU時間を消費します。この研究では、量子化におけるいくつかの固有の学習可能なパラメーター、つまり量子化器のスケールファクターが、特定のビット幅での最終精度へのそのレイヤーの寄与を反映して、レイヤーの重要性インジケーターとして機能できることを明らかにします。これらの重要性指標は、量子化を意識したトレーニング中に数値変換を自然に認識します。これにより、レイヤーの量子化感度メトリックを正確かつ正確に提供できます。ただし、深いネットワークには常に数百のそのような指標が含まれており、それらを1つずつトレーニングすると、過度の時間コストが発生します。この問題を克服するために、すべての指標を一度に取得できる共同トレーニングスキームを提案します。元の順次トレーニングプロセスを並列化することにより、インジケータートレーニングプロセスを大幅に高速化します。これらの学習された重要性指標を使用して、MPQ探索問題を1回限りの整数線形計画（ILP）問題として定式化します。これにより、反復検索が回避され、ビット幅の検索スペースを制限することなく検索時間が大幅に短縮されます。たとえば、インジケーターを使用したResNet18でのMPQ検索には0.06秒しかかかりません。また、広範な実験により、私たちのアプローチは、さまざまな制約（BitOps、圧縮率など）のある広範囲のモデルに対してImageNetでSOTAの精度を達成できることが示されています。

The exponentially large discrete search space in mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer. Previous works usually resort to iterative search methods on the training set, which consume hundreds or even thousands of GPU-hours. In this study, we reveal that some unique learnable parameters in quantization, namely the scale factors in the quantizer, can serve as importance indicators of a layer, reflecting the contribution of that layer to the final accuracy at certain bit-widths. These importance indicators naturally perceive the numerical transformation during quantization-aware training, which can precisely and correctly provide quantization sensitivity metrics of layers. However, a deep network always contains hundreds of such indicators, and training them one by one would lead to an excessive time cost. To overcome this issue, we propose a joint training scheme that can obtain all indicators at once. It considerably speeds up the indicators training process by parallelizing the original sequential training processes. With these learned importance indicators, we formulate the MPQ search problem as a one-time integer linear programming (ILP) problem. That avoids the iterative search and significantly reduces search time without limiting the bit-width search space. For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds. Also, extensive experiments show our approach can achieve SOTA accuracy on ImageNet for far-ranging models with various constraints (e.g., BitOps, compress rate).

updated: Mon May 09 2022 07:56:28 GMT+0000 (UTC)

published: Wed Mar 16 2022 03:23:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト