Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting Spatio-temporal Sparsity

Chang Gao; Tobi Delbruck; Shih-Chii Liu

Spartus：時空間スパース性を利用した9.4 TOp / sFPGAベースのLSTMアクセラレータ

長短期記憶（LSTM）リカレントネットワークは、音声認識などの時系列データを含むタスクに頻繁に使用されます。ただし、完全に接続された構造によりLSTMネットワークがメモリ制限アルゴリズムになるため、これらのネットワークをハードウェアに展開して高スループットと低遅延を実現することは困難です。 LSTMアクセラレータの以前の作業では、重みの空間的スパース性または時間的スパース性のいずれかを利用していました。この論文では、時空間スパース性を利用して超低遅延推論を実現する「Spartus」と呼ばれる新しいアクセラレータを紹介します。空間スパース性は、列バランスターゲットドロップアウト（CBTD）と呼ばれる提案されたプルーニング方法を使用して誘導されました。これにより、ワークロードバランスに役立つ構造化されたスパース重み行列が生成されます。 TIMIT電話認識タスクでトレーニングされたLSTMネットワークの精度の違いはごくわずかで、最大96％の重みスパース性を実現しました。 LSTMに時間的スパース性を誘発するために、以前のDeltaGRUメソッドをLSTMネットワークに拡張してDeltaLSTMを作成します。この組み合わされたスパース性により、重みメモリアクセスと関連する算術演算を同時に節約できます。 Spartusは、ザイリンクスZynq-7100FPGAに実装されました。 Spartusで実行されている1024ニューロンの単一のDeltaLSTMレイヤーのサンプルあたりのレイテンシーは1usです。 Spartusは、9.4 TOp / sの有効バッチ1スループットと1.1TOp / Jのエネルギー効率を達成しました。これらは、以前の最先端技術よりもそれぞれ4倍と7倍高くなっています。

Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time sequential data such as speech recognition. However, it is difficult to deploy these networks on hardware to achieve high throughput and low latency because the fully-connected structure makes LSTM networks a memory-bounded algorithm. Previous work in LSTM accelerators either exploited weight spatial sparsity or temporal sparsity. In this paper, we present a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. The spatial sparsity was induced using our proposed pruning method called Column-Balanced Targeted Dropout (CBTD) that leads to structured sparse weight matrices benefiting workload balance. It achieved up to 96% weight sparsity with negligible accuracy difference for an LSTM network trained on a TIMIT phone recognition task. To induce temporal sparsity in LSTM, we create the DeltaLSTM by extending the previous DeltaGRU method to the LSTM network. This combined sparsity saves on weight memory access and associated arithmetic operations simultaneously. Spartus was implemented on a Xilinx Zynq-7100 FPGA. The per-sample latency for a single DeltaLSTM layer of 1024 neurons running on Spartus is 1 us. Spartus achieved 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/J energy efficiency, which are respectively 4X and 7X higher than the previous state-of-the-art.

updated: Wed Aug 04 2021 22:02:14 GMT+0000 (UTC)

published: Wed Aug 04 2021 22:02:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト