Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition

Yu Gong; Miao Yin; Lingyi Huang; Chunhua Deng; Yang Sui; Bo Yuan

階層的タッカーテンソル分解によるビデオ認識のためのエネルギー効率の高い LSTM ネットワークのアルゴリズムとハードウェアの協調設計

長期短期記憶 (LSTM) は、多くのシーケンス分析およびモデリングアプリケーションで広く使用されている強力なディープニューラルネットワークの一種です。ただし、LSTM ネットワークの大きなモデルサイズの問題により、特に高次元の入力データを必要とするビデオ認識タスクの場合、その実用的な展開は依然として非常に困難です。この制限を克服し、LSTM モデルの可能性を完全に解き放つことを目的として、この論文では、高性能でエネルギー効率の高い LSTM ネットワークに向けてアルゴリズムとハードウェアの協調設計を実行することを提案します。アルゴリズムレベルでは、完全に分解された階層型タッカー (FDHT) 構造ベースの LSTM、すなわち FDHT-LSTM を開発することを提案します。このような魅力的なアルゴリズムの利点を十分に享受するために、提案された FDHT-LSTM モデルの効率的な実行をサポートするために、対応するカスタマイズされたハードウェアアーキテクチャをさらに開発します。メモリアクセススキームの繊細な設計により、複雑な行列変換は、オンザフライでのアクセス競合なしに、基盤となるハードウェアによって効率的にサポートされます。私たちの評価結果は、提案された超小型 FDHT-LSTM モデルと対応するハードウェアアクセラレータの両方が非常に高いパフォーマンスを達成することを示しています。最先端の圧縮 LSTM モデルと比較して、FDHT-LSTM は、モデルサイズの桁違いの削減と、さまざまなビデオ認識データセット全体での大幅な精度向上の両方を享受します。一方、最先端のテンソル分解モデル指向ハードウェア TIE と比較して、提案された FDHT-LSTM アーキテクチャは、LSTM-Youtube ワークロードで、それぞれスループット、面積効率、およびエネルギー効率で優れたパフォーマンスを実現します。 LSTM-UCF ワークロードの場合、提案された設計は、より高いスループット、より高いエネルギー効率、同等の面積効率で TIE よりも優れています。

Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.

updated: Mon Dec 05 2022 05:51:56 GMT+0000 (UTC)

published: Mon Dec 05 2022 05:51:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト