Self-Attentive Pooling for Efficient Deep Learning

Fang Chen; Gourav Datta; Souvik Kundu; Peter Beerel

効率的な深層学習のための自己注意プーリング

特徴マップの次元を積極的にトリミングして、リソースに制約のあるコンピュータービジョンアプリケーションの推論計算とメモリフットプリントを削減できる効率的なカスタムプーリング手法が、最近大きな注目を集めています。ただし、以前のプーリング作業では、アクティベーションマップのローカルコンテキストのみが抽出され、その有効性が制限されていました。対照的に、最大/平均プーリングやストライド畳み込みなど、標準のプーリング層のドロップイン代替として使用できる、新しい非ローカル自己注意プーリング方法を提案します。提案された自己注意モジュールは、パッチ埋め込み、マルチヘッド自己注意、および空間チャネル復元を使用し、続いてシグモイド活性化と指数関数的ソフトマックスを使用します。この自己注意メカニズムは、ダウンサンプリング中に非ローカルアクティベーションパッチ間の依存関係を効率的に集約します。さまざまな畳み込みニューラルネットワーク (CNN) アーキテクチャを使用した標準オブジェクトの分類および検出タスクに関する広範な実験により、最先端の (SOTA) プーリング手法に対する提案メカニズムの優位性が実証されました。特に、ImageNet 上の MobileNet-V2 のさまざまなバリアントに対する既存のプーリング手法のテスト精度を平均 1.2% 上回っています。初期レイヤーのアクティベーションマップの積極的なダウンサンプリング (メモリ消費量を最大 22 分の 1 に削減) により、私たちのアプローチは、等メモリフットプリントを使用する SOTA 手法と比較して、1.43% 高いテスト精度を実現します。これにより、マイクロコントローラーなどのメモリに制約のあるデバイスにモデルを展開することができます (大幅な精度を失うことなく)。これは、最初のアクティベーションマップが、複雑なビジョンタスクに必要な高解像度画像のために大量のオンチップメモリを消費するためです。私たちが提案するプーリング方法は、チャネルのプルーニングのアイデアを活用して、メモリのフットプリントをさらに削減します。

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.

updated: Mon Sep 19 2022 03:53:41 GMT+0000 (UTC)

published: Fri Sep 16 2022 00:35:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト