Cumulative Spatial Knowledge Distillation for Vision Transformers

Borui Zhao; Renjie Song; Jiajun Liang

ビジョントランスフォーマーのための累積的空間知識蒸留

畳み込みニューラルネットワーク (CNN) からの知識の抽出は、ビジョントランスフォーマー (ViT) にとって諸刃の剣です。 CNN の画像に適した局所誘導バイアスにより、ViT の学習がより速く、より良くなるため、パフォーマンスが向上しますが、次の 2 つの問題が発生します: (1) CNN と ViT のネットワーク設計は完全に異なるため、中間特徴のセマンティックレベルが異なります。、空間的な知識伝達方法 (特徴の模倣など) が非効率になります。 (2) CNN から知識を抽出すると、ViT のグローバル情報を統合する能力が CNN のローカル誘導バイアス監視によって抑制されるため、後のトレーニング期間でのネットワークの収束が制限されます。この目的を達成するために、累積空間知識蒸留 (CSKD) を提案します。 CSKD は、中間特徴を導入することなく、CNN の対応する空間応答から ViT のすべてのパッチトークンに対する空間的な知識を抽出します。さらに、CSKD は累積知識融合 (CKF) モジュールを活用し、CNN のグローバルな対応を紹介し、トレーニング中にその重要性をますます強調します。 CKF を適用すると、トレーニング期間の初期には CNN のローカル帰納バイアスが活用され、トレーニング期間の後半には ViT のグローバルな機能が最大限に発揮されます。 ImageNet-1k およびダウンストリームデータセットに関する広範な実験と分析により、CSKD の優位性が実証されています。コードは公開される予定です。

Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.

updated: Mon Jul 17 2023 14:03:45 GMT+0000 (UTC)

published: Mon Jul 17 2023 14:03:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト