Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting Spatio-temporal Sparsity

Chang Gao; Tobi Delbruck; Shih-Chii Liu

Spartus：時空間スパース性を利用した9.4 TOp / sFPGAベースのLSTMアクセラレータ

長短期記憶（LSTM）リカレントネットワークは、音声認識などの時系列データを含むタスクに頻繁に使用されます。ただし、完全に接続された構造によりLSTMネットワークがメモリ制限アルゴリズムになるため、これらのネットワークをハードウェアに展開して高スループットと低遅延を実現することは困難です。以前のLSTMアクセラレータは、重みの空間的スパース性または時間的アクティブ化スパース性のいずれかを利用していました。この論文では、時空間スパース性を利用して超低遅延推論を実現する「Spartus」と呼ばれる新しいアクセラレータを提案します。空間スパース性は、カラムバランスターゲットドロップアウト（CBTD）と呼ばれる提案されたプルーニング方法を使用して誘導されます。これは、バランスの取れたワークロードのためにスパースな重み行列を構築します。 TIMIT電話認識タスクでトレーニングされたLSTMネットワークの精度の違いはごくわずかで、最大96％の重みスパース性を実現しました。 LSTMに時間的スパース性を誘発するために、以前のDeltaGRUメソッドをLSTMネットワークに拡張してDeltaLSTMを作成します。この組み合わされたスパース性は、重みメモリアクセスと関連する算術演算を同時に節約します。 Spartusは、ザイリンクスZynq-7100FPGAに実装されました。 1024ニューロンの単一のDeltaLSTMレイヤーのサンプルあたりのSpartusレイテンシーは、平均1usです。 Spartusは、9.4 TOp / sの有効バッチ1スループットと1.1TOp / Jのエネルギー効率を達成しました。これらは、それぞれ、以前の最先端技術の4倍と7倍です。

Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time-sequential data such as speech recognition. However, it is difficult to deploy these networks on hardware to achieve high throughput and low latency because the fully connected structure makes LSTM networks a memory-bounded algorithm. Previous LSTM accelerators either exploited weight spatial sparsity or temporal activation sparsity. This paper proposes a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. The spatial sparsity is induced using our proposed pruning method called Column-Balanced Targeted Dropout (CBTD), which structures sparse weight matrices for balanced workload. It achieved up to 96% weight sparsity with negligible accuracy difference for an LSTM network trained on a TIMIT phone recognition task. To induce temporal sparsity in LSTM, we create the DeltaLSTM by extending the previous DeltaGRU method to the LSTM network. This combined sparsity simultaneously saves on the weight memory access and associated arithmetic operations. Spartus was implemented on a Xilinx Zynq-7100 FPGA. The Spartus per-sample latency for a single DeltaLSTM layer of 1024 neurons averages 1 us. Spartus achieved 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/J energy efficiency, which, respectively, are 4X and 7X higher than the previous state-of-the-art.

updated: Fri Aug 20 2021 14:29:37 GMT+0000 (UTC)

published: Wed Aug 04 2021 22:02:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト