Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation

Stylianos I. Venieris; Javier Fernandez-Marques; Nicholas D. Lane

オンザフライ重み生成による CNN エンジンのメモリウォール効果の軽減

広範な AI タスクにわたる畳み込みニューラルネットワーク (CNN) の前例のない精度により、モバイル環境や組み込み環境での広範な導入が行われています。高性能でエネルギー効率の高い推論を追求するため、FPGA ベースの CNN アクセラレータの設計に多大な研究努力が費やされてきました。この文脈では、単一の計算エンジンは、ファブリック再構成のオーバーヘッドなしで多様な CNN モードをサポートする一般的なアプローチを構成します。それにもかかわらず、この柔軟性は多くの場合、エンジンの固定構成上の特定のレイヤーのマッピングが最適ではないため、メモリに制約されたレイヤーでのパフォーマンスの大幅な低下とリソースの十分な活用を伴いません。この研究では、実行時に重みを解凍するための事前畳み込みステージを導入するモデルのクラスに対する CNN エンジン設計の観点からの影響を調査します。これらのアプローチをオンザフライと呼びます。この論文では、既存の CNN エンジンの制限を打ち消す新しい CNN 推論システムである unzipFPGA について説明します。提案されたフレームワークは、オンチップでのオンザフライの重み生成を可能にする重み生成モジュールを導入した新しい CNN ハードウェアアーキテクチャで構成され、メモリに束縛された層に対する帯域幅の制限による悪影響を軽減します。ターゲットの CNN とデバイスのペアに合わせて重み生成メカニズムを調整する自動化されたハードウェア認識手法によって unzipFPGA をさらに強化し、精度とパフォーマンスのバランスを向上させます。最後に、最適にマッピングされていないレイヤー内の PE 間の負荷のバランスをとる、入力選択型処理要素 (PE) 設計を導入します。提案されたフレームワークは、同じ電力制約で高度に最適化された GPU 設計と比較して平均 2.57 倍のパフォーマンス効率向上を達成し、さまざまな最先端の FPGA ベースの CNN で最大 3.94 倍の高いパフォーマンス密度を達成するハードウェア設計を実現します。加速器。

The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular approach to support diverse CNN modes without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. The proposed framework yields hardware designs that achieve an average of 2.57x performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94x higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.

updated: Tue Jul 25 2023 11:19:21 GMT+0000 (UTC)

published: Tue Jul 25 2023 11:19:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト