unzipFPGA: Enhancing FPGA-based CNN Engines with On-the-Fly Weights Generation

Stylianos I. Venieris; Javier Fernandez-Marques; Nicholas D. Lane

unzipFPGA：オンザフライウェイト生成によるFPGAベースのCNNエンジンの強化

単一の計算エンジンは、FPGAベースの畳み込みニューラルネットワーク（CNN）の一般的な設計上の選択肢となり、ファブリックを再構成することなく多様なモデルの展開を可能にします。ただし、この柔軟性により、エンジンの固定構成での特定のレイヤーのマッピングが最適化されていないため、メモリにバインドされたレイヤーのパフォーマンスが大幅に低下し、リソースが十分に活用されないことがよくあります。この作業では、実行時に重みを解凍するための畳み込み前段階を導入するモデルのクラスに対するCNNエンジン設計の観点からの影響を調査します。これらのアプローチをオンザフライと呼びます。限られた帯域幅がメモリバウンドレイヤーに与える悪影響を最小限に抑えるために、オンチップでオンザフライで重みを生成できる新しいハードウェアコンポーネントを紹介します。さらに、最適にマッピングされていないレイヤー上のPE間の負荷を分散する入力選択処理要素（PE）の設計を紹介します。最後に、オンザフライモデルをトレーニングし、設計スペースをトラバースして最高のパフォーマンスを発揮するCNNエンジン構成を選択するためのフレームワークであるunzipFPGAを紹介します。定量的評価によると、unzipFPGAは、帯域幅が制限された状態で最適化された現状のCNNエンジンよりも平均2.14倍、71％向上し、最先端のFPGAベースのCNNアクセラレータよりも最大3.69倍高いパフォーマンス密度を実現します。

Single computation engines have become a popular design choice for FPGA-based convolutional neural networks (CNNs) enabling the deployment of diverse models without fabric reconfiguration. This flexibility, however, often comes with significantly reduced performance on memory-bound layers and resource underutilisation due to suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. To minimise the negative impact of limited bandwidth on memory-bound layers, we present a novel hardware component that enables the on-chip on-the-fly generation of weights. We further introduce an input selective processing element (PE) design that balances the load between PEs on suboptimally mapped layers. Finally, we present unzipFPGA, a framework to train on-the-fly models and traverse the design space to select the highest performing CNN engine configuration. Quantitative evaluation shows that unzipFPGA yields an average speedup of 2.14x and 71% over optimised status-quo and pruned CNN engines under constrained bandwidth and up to 3.69x higher performance density over the state-of-the-art FPGA-based CNN accelerators.

updated: Sat Apr 03 2021 14:15:01 GMT+0000 (UTC)

published: Tue Mar 09 2021 18:19:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト