Channel-wise Knowledge Distillation for Dense Prediction

Changyong Shu; Yifan Liu; Jianfei Gao; Lin Xu; Chunhua Shen

高密度予測のためのチャネルごとの知識蒸留

知識蒸留（KD）は、コンパクトモデルをトレーニングするためのシンプルで効果的なツールであることが証明されています。密な予測タスクのほとんどすべてのKDバリアントは、通常、ポイントごとおよび/またはペアごとの不一致を最小限に抑えることにより、空間ドメインで生徒と教師のネットワークの機能マップを調整します。セマンティックセグメンテーションでは、各チャネルの一部のレイヤーの機能アクティベーションがシーンカテゴリの顕著性をエンコードする傾向があることを確認し（クラスアクティベーションマッピングに類似）、生徒と教師のネットワーク間でチャネルごとに機能を調整することを提案します。この目的のために、最初に各チャネルの特徴マップをソフトマックス正規化を使用して確率マップに変換し、次に2つのネットワークの対応するチャネルのカルバックライブラー（KL）発散を最小化します。そうすることにより、私たちの方法は、ネットワーク間のチャネルのソフトな分布を模倣することに焦点を当てています。特に、KL発散は、おそらくセマンティックセグメンテーションに最も有用な信号に対応する、チャネルワイズマップの最も顕著な領域により多くの注意を払うことを学習することを可能にします。実験によると、チャネルごとの蒸留は、セマンティックセグメンテーションの既存のほとんどすべての空間蒸留方法を大幅に上回り、トレーニング中の計算コストが少なくて済みます。さまざまなネットワーク構造を持つ3つのベンチマークで、一貫して優れたパフォーマンスを実現しています。コードはhttps://git.io/ChannelDisで入手できます。

Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probabilty map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures. Code is available at: https://git.io/ChannelDis

updated: Sat Jul 31 2021 08:24:08 GMT+0000 (UTC)

published: Thu Nov 26 2020 12:00:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト