Environmental Sound Classification on the Edge: Deep Acoustic Networks for Extremely Resource-Constrained Devices

Md Mohaimenuzzaman; Christoph Bergmeir; Ian Thomas West; Bernd Meyer

エッジでの環境音の分類：非常にリソースに制約のあるデバイスのための深い音響ネットワーク

デスクトップおよびクラウドシステムの分類および認識能力をエッジデバイスに直接もたらすために、多大な努力が払われています。エッジでのディープラーニングの主な課題は、極端なリソースの制約（メモリ、CPU速度、GPUサポートの欠如）を処理することです。 ESC-50で最先端のパフォーマンスに近いパフォーマンスを実現する、オーディオ分類のエッジソリューションを紹介します。これは、リソースに制約のない大規模なネットワークの評価に使用されるのと同じベンチマークです。重要なのは、エッジデバイス用にネットワークを特別に設計していないことです。むしろ、圧縮と量子化を介して大規模な深い畳み込みニューラルネットワーク（CNN）を、リソースが不足しているエッジデバイスに適したネットワークに自動的に変換するユニバーサルパイプラインを紹介します。最初に、新しいサウンド分類アーキテクチャであるACDNetを紹介します。これは、ESC-10とESC-50の両方でそれぞれ96.75％と87.05％という最先端の精度を実現します。次に、ネットワークに依存しない新しいアプローチを使用してACDNetを圧縮し、非常に小さなモデルを取得します。サイズが97.22％削減され、FLOPが97.28％削減されたにもかかわらず、圧縮ネットワークはESC-50で82.90％の精度を達成し、最先端に近い状態を維持しています。 8ビット量子化を使用して、ACDNetを標準のマイクロコントローラーユニット（MCU）に展開します。私たちの知る限り、50クラスの音声分類のためのディープネットワークがエッジデバイスに正常に展開されたのはこれが初めてです。これはそれ自体が興味深いことですが、最小サイズのネットワークを手作りするのではなく、ユニバーサル変換パイプラインを使用してこれを実現することが特に重要であると考えています。

Significant efforts are being invested to bring the classification and recognition powers of desktop and cloud systems directly to edge devices. The main challenge for deep learning on the edge is to handle extreme resource constraints(memory, CPU speed and lack of GPU support). We present an edge solution for audio classification that achieves close to state-of-the-art performance on ESC-50, the same benchmark used to assess large, non resource-constrained networks. Importantly, we do not specifically engineer the network for edge devices. Rather, we present a universal pipeline that converts a large deep convolutional neural network (CNN) automatically via compression and quantization into a network suitable for resource-impoverished edge devices. We first introduce a new sound classification architecture, ACDNet, that produces above state-of-the-art accuracy on both ESC-10 and ESC-50 which are 96.75% and 87.05% respectively. We then compress ACDNet using a novel network-independent approach to obtain an extremely small model. Despite 97.22% size reduction and 97.28% reduction in FLOPs, the compressed network still achieves 82.90% accuracy on ESC-50, staying close to the state-of-the-art. Using 8-bit quantization, we deploy ACDNet on standard microcontroller units (MCUs). To the best of our knowledge, this is the first time that a deep network for sound classification of 50 classes has successfully been deployed on an edge device. While this should be of interestin its own right, we believe it to be of particular importance that this has been achieved with a universal conversion pipeline rather than hand-crafting a network for minimal size.

updated: Mon Mar 22 2021 00:07:25 GMT+0000 (UTC)

published: Fri Mar 05 2021 05:52:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト