GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization

Yi Guo; Huan Yuan; Jianchao Tan; Zhangyang Wang; Sen Yang; Ji Liu

GDP：微分可能な偏波を持つゲートを介した安定化されたニューラルネットワークのプルーニング

モデル圧縮技術は、さまざまなリアルタイムアプリケーション用の効率的なAIモデルを取得するために、最近爆発的な注目を集めています。チャネルプルーニングは重要な圧縮戦略の1つであり、さまざまなDNNのスリム化に広く使用されています。以前のゲートベースまたは重要度ベースのプルーニング方法は、重要度が最も小さいチャネルを削除することを目的としています。ただし、チャネルの重要性を測定する基準が不明なままであるため、さまざまなチャネル選択ヒューリスティックが発生します。他のいくつかのサンプリングベースのプルーニング方法は、サンプリング戦略を展開してサブネットをトレーニングします。これにより、トレーニングが不安定になり、圧縮モデルのパフォーマンスが低下することがよくあります。研究のギャップを考慮して、原理的な最適化のアイデアに触発された、微分可能偏光（GDP）を備えたGatesという名前の新しいモジュールを紹介します。 GDPは、ベルやホイッスルのない畳み込み層の前に接続して、各チャネルまたは層ブロック全体のオンとオフを制御できます。トレーニングプロセス中、分極効果により、ゲートのサブセットがスムーズに減少して正確なゼロになりますが、他のゲートは徐々にゼロから大きく離れます。トレーニングが終了すると、これらのゼロゲートチャネルは痛みを伴わずに削除できますが、他の非ゼロゲートは後続の畳み込みカーネルに吸収されるため、トレーニングが中断されたり、トレーニングされたモデルが損傷したりすることはありません。 CIFAR-10およびImageNetデータセットで実施された実験は、提案されたGDPアルゴリズムが、幅広い剪定比でさまざまなベンチマークDNNで最先端のパフォーマンスを達成することを示しています。また、挑戦的なPascal VOCセグメンテーションタスクのDeepLabV3Plus-ResNet50にGDPを適用します。このタスクでは、テストパフォーマンスが低下せず（わずかに改善されても）、60％を超えるFLOPが節約されます。

Model compression techniques are recently gaining explosive attention for obtaining efficient AI models for various real-time applications. Channel pruning is one important compression strategy and is widely used in slimming various DNNs. Previous gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. However, it remains unclear what criteria the channel importance should be measured on, leading to various channel selection heuristics. Some other sampling-based pruning methods deploy sampling strategies to train sub-nets, which often causes the training instability and the compressed model's degraded performance. In view of the research gaps, we present a new module named Gates with Differentiable Polarization (GDP), inspired by principled optimization ideas. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel or whole layer block. During the training process, the polarization effect will drive a subset of gates to smoothly decrease to exact zero, while other gates gradually stay away from zero by a large margin. When training terminates, those zero-gated channels can be painlessly removed, while other non-zero gates can be absorbed into the succeeding convolution kernel, causing completely no interruption to training nor damage to the trained model. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP algorithm achieves the state-of-the-art performance on various benchmark DNNs at a broad range of pruning ratios. We also apply GDP to DeepLabV3Plus-ResNet50 on the challenging Pascal VOC segmentation task, whose test performance sees no drop (even slightly improved) with over 60% FLOPs saving.

updated: Wed Sep 08 2021 07:51:17 GMT+0000 (UTC)

published: Mon Sep 06 2021 03:17:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト