LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Qiang Chen; Xiangbo Su; Xinyu Zhang; Jian Wang; Jiahui Chen; Yunpeng Shen; Chuchu Han; Ziliang Chen; Weixiang Xu; Fanrong Li; Shan Zhang; Kun Yao; Errui Ding; Gang Zhang; Jingdong Wang

In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).

updated: Wed Jun 05 2024 17:07:24 GMT+0000 (UTC)

published: Wed Jun 05 2024 17:07:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト