FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs with Dynamic Fixed-Point Representation

Ahmad Shawahna; Sadiq M. Sait; Aiman El-Maleh; Irfan Ahmad

FxP-QNet：動的固定小数点表現を使用した混合低精度DNNの設計のためのトレーニング後の量子化器

ディープニューラルネットワーク（DNN）は、集中的な計算とメモリを必要とする複雑で深い構造を通じて得られる最先端の結果により、幅広いコンピュータービジョンタスクでその有効性を実証しています。今日では、効率的なモデル推論は、リソースに制約のあるプラットフォーム上のコンシューマーアプリケーションにとって非常に重要です。その結果、DNNのスループットとエネルギー効率を改善するための専用の深層学習（DL）ハードウェアの研究開発に大きな関心が寄せられています。量子化によるDNNデータ構造の低精度表現は、特殊なDLハードウェアに大きなメリットをもたらします。ただし、厳密な量子化により、精度が大幅に低下します。そのため、量子化はビット精度レベルで大きなハイパーパラメータ空間を開きます。その探索は大きな課題です。本論文では、整数演算のみの展開のために混合低精度DNNを柔軟に設計するディープニューラルネットワークの固定小数点量子化器（FxP-QNet）と呼ばれる新しいフレームワークを提案します。具体的には、FxP-QNetは、ネットワーク精度と低精度要件の間のトレードオフに基づいて、各レイヤーの各データ構造の量子化レベルを徐々に適応させます。さらに、トレーニング後の自己蒸留とネットワーク予測エラー統計を使用して、浮動小数点値の固定小数点数への量子化を最適化します。最先端のアーキテクチャとベンチマークImageNetデータセットでFxP-QNetを調べることで、トレーニングを必要とせずに精度と圧縮のトレードオフを達成する上でのFxP-QNetの有効性を実証的に示します。結果は、FxP-QNetで量子化されたAlexNet、VGG-16、およびResNet-18が、対応する完全精度のメモリ要件全体を7.16x、10.36x、および6.44x削減し、0.95％、0.95％、およびそれぞれ1.99％の精度低下。

Deep neural networks (DNNs) have demonstrated their effectiveness in a wide range of computer vision tasks, with the state-of-the-art results obtained through complex and deep structures that require intensive computation and memory. Now-a-days, efficient model inference is crucial for consumer applications on resource-constrained platforms. As a result, there is much interest in the research and development of dedicated deep learning (DL) hardware to improve the throughput and energy efficiency of DNNs. Low-precision representation of DNN data-structures through quantization would bring great benefits to specialized DL hardware. However, the rigorous quantization leads to a severe accuracy drop. As such, quantization opens a large hyper-parameter space at bit-precision levels, the exploration of which is a major challenge. In this paper, we propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet) that flexibly designs a mixed low-precision DNN for integer-arithmetic-only deployment. Specifically, the FxP-QNet gradually adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements. Additionally, it employs post-training self-distillation and network prediction error statistics to optimize the quantization of floating-point values into fixed-point numbers. Examining FxP-QNet on state-of-the-art architectures and the benchmark ImageNet dataset, we empirically demonstrate the effectiveness of FxP-QNet in achieving the accuracy-compression trade-off without the need for training. The results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x with less than 0.95%, 0.95%, and 1.99% accuracy drop, respectively.

updated: Tue Mar 22 2022 23:01:43 GMT+0000 (UTC)

published: Tue Mar 22 2022 23:01:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト