Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Yu Wang; Xin Li; Shengzhao Wen; Fukui Yang; Wanping Zhang; Gang Zhang; Haocheng Feng; Junyu Han; Errui Ding

一貫した蒸留点サンプリングによる検出トランスフォーマーの知識蒸留

DETR は新しいエンドツーエンドのトランスフォーマーアーキテクチャオブジェクト検出器であり、モデルサイズをスケールアップする際に従来の検出器よりも大幅に優れています。この論文では、知識の蒸留によるDETRの圧縮に焦点を当てています。知識の蒸留は従来の検出器でよく研究されていますが、DETR で効果的に機能させる方法についての研究は不足しています。最初に、実験的および理論的分析を提供して、DETR 蒸留の主な課題は、一貫した蒸留点の欠如であることを指摘します。蒸留点は、生徒が模倣する予測の対応する入力を指し、信頼できる蒸留には、教師と生徒の間で一貫した十分な蒸留点が必要です。この観察に基づいて、一貫した蒸留点サンプリングを使用した DETR(KD-DETR) の一般的な知識蒸留パラダイムを提案します。具体的には、一連の特殊なオブジェクトクエリを導入して蒸留ポイントを構築することにより、検出タスクと蒸留タスクを分離します。このパラダイムでは、KD-DETR の拡張性を調査するために、一般から特定の蒸留ポイントのサンプリング戦略をさらに提案します。さまざまなスケールのバックボーンとトランスフォーマーレイヤーを使用したさまざまな DETR アーキテクチャでの広範な実験により、KD-DETR の有効性と一般化が検証されます。 KD-DETR は、ResNet-18 および ResNet-50 バックボーンを使用した DAB-DETR のパフォーマンスをそれぞれ 41.4%、45.7% の mAP に押し上げ、ベースラインよりも 5.2%、3.5% 高く、ResNet-50 は教師モデルをも上回っています。 2.2%減。

DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up the model size. In this paper, we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors, there is a lack of researches on how to make it work effectively on DETR. We first provide experimental and theoretical analysis to point out that the main challenge in DETR distillation is the lack of consistent distillation points. Distillation points refer to the corresponding inputs of the predictions for student to mimic, and reliable distillation requires sufficient distillation points which are consistent between teacher and student. Based on this observation, we propose a general knowledge distillation paradigm for DETR(KD-DETR) with consistent distillation points sampling. Specifically, we decouple detection and distillation tasks by introducing a set of specialized object queries to construct distillation points. In this paradigm, we further propose a general-to-specific distillation points sampling strategy to explore the extensibility of KD-DETR. Extensive experiments on different DETR architectures with various scales of backbones and transformer layers validate the effectiveness and generalization of KD-DETR. KD-DETR boosts the performance of DAB-DETR with ResNet-18 and ResNet-50 backbone to 41.4%, 45.7% mAP, respectively, which are 5.2%, 3.5% higher than the baseline, and ResNet-50 even surpasses the teacher model by 2.2%.

updated: Tue Nov 15 2022 11:52:30 GMT+0000 (UTC)

published: Tue Nov 15 2022 11:52:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト