Shifting Capsule Networks from the Cloud to the Deep Edge

Miguel Costa; Diogo Costa; Tiago Gomes; Sandro Pinto

カプセルネットワークをクラウドからディープエッジに移行

カプセルネットワーク（CapsNets）は、画像処理における新たなトレンドです。畳み込みニューラルネットワークとは対照的に、CapsNetsは、オブジェクトの相対的な空間情報がネットワーク全体で保持されるため、オブジェクトの変形に対して脆弱ではありません。ただし、それらの複雑さは主にカプセル構造と動的ルーティングメカニズムに関連しているため、CapsNetを元の形式で小型マイクロコントローラー（MCU）を搭載したリソースに制約のあるデバイスに展開することはほとんど不合理です。インテリジェンスがクラウドからエッジに急速に移行している時代では、この高度な複雑さは、まさにエッジでのCapsNetの採用に深刻な課題を課します。この問題に取り組むために、ArmCortex-MおよびRISC-VMCUで量子化されたCapsNetを実行するためのAPIを紹介します。当社のソフトウェアカーネルは、ArmCMSIS-NNおよびRISC-VPULP-NNを拡張して、オペランドとして8ビット整数を使用するカプセル演算をサポートします。それに伴い、CapsNetのトレーニング後の量子化を実行するためのフレームワークを提案します。結果は、メモリフットプリントがほぼ75％削減され、精度の低下が0.07％から0.18％の範囲であることを示しています。スループットに関しては、Arm Cortex-M APIにより、中型カーネルを使用したプライマリカプセル層とカプセル層をそれぞれわずか119.94ミリ秒と90.60ミリ秒（ms）で実行できます（STM32H755ZIT6U、Cortex-M7 @ 480MHz）。 GAP-8 SoC（RISC-V RV32IMCXpulp @ 170 MHz）の場合、遅延はそれぞれ7.02ミリ秒と38.03ミリ秒に低下します。

Capsule networks (CapsNets) are an emerging trend in image processing. In contrast to a convolutional neural network, CapsNets are not vulnerable to object deformation, as the relative spatial information of the objects is preserved across the network. However, their complexity is mainly related to the capsule structure and the dynamic routing mechanism, which makes it almost unreasonable to deploy a CapsNet, in its original form, in a resource-constrained device powered by a small microcontroller (MCU). In an era where intelligence is rapidly shifting from the cloud to the edge, this high complexity imposes serious challenges to the adoption of CapsNets at the very edge. To tackle this issue, we present an API for the execution of quantized CapsNets in Arm Cortex-M and RISC-V MCUs. Our software kernels extend the Arm CMSIS-NN and RISC-V PULP-NN to support capsule operations with 8-bit integers as operands. Along with it, we propose a framework to perform post-training quantization of a CapsNet. Results show a reduction in memory footprint of almost 75%, with accuracy loss ranging from 0.07% to 0.18%. In terms of throughput, our Arm Cortex-M API enables the execution of primary capsule and capsule layers with medium-sized kernels in just 119.94 and 90.60 milliseconds (ms), respectively (STM32H755ZIT6U, Cortex-M7 @ 480 MHz). For the GAP-8 SoC (RISC-V RV32IMCXpulp @ 170 MHz), the latency drops to 7.02 and 38.03 ms, respectively.

updated: Wed Jun 15 2022 10:41:49 GMT+0000 (UTC)

published: Wed Oct 06 2021 16:52:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト