Anchor DETR: Query Design for Transformer-Based Object Detection

Yingming Wang; Xiangyu Zhang; Tong Yang; Jian Sun

アンカーDETR：トランスベースのオブジェクト検出のためのクエリ設計

本論文では、トランスベースの物体検出のための新しいクエリ設計を提案する。以前のトランスベースの検出器では、オブジェクトクエリは学習された埋め込みのセットです。ただし、学習した各埋め込みには明示的な物理的意味がなく、どこに焦点を当てるかを説明することはできません。各オブジェクトクエリの予測スロットには特定のモードがないため、最適化は困難です。つまり、各オブジェクトクエリは特定の領域に焦点を合わせません。これらの問題を解決するために、クエリの設計では、オブジェクトクエリはアンカーポイントに基づいています。アンカーポイントは、CNNベースの検出器で広く使用されています。したがって、各オブジェクトクエリは、アンカーポイントの近くのオブジェクトに焦点を合わせます。さらに、クエリ設計では、「1つの領域、複数のオブジェクト」という問題を解決するために、1つの位置で複数のオブジェクトを予測できます。さらに、DETRの標準的なアテンションと同等またはそれ以上のパフォーマンスを実現しながら、メモリコストを削減できるアテンションバリアントを設計します。クエリの設計とアテンションバリアントのおかげで、アンカーDETRと呼ばれる提案された検出器は、10分の1のトレーニングエポックでDETRよりも優れたパフォーマンスを実現し、より高速に実行できます。たとえば、ResNet50-DC5機能を使用して50エポックをトレーニングすると、MSCOCOデータセットで19FPSで44.2APを達成します。 MSCOCOベンチマークに関する広範な実験により、提案された方法の有効性が証明されています。コードはhttps://github.com/megvii-research/AnchorDETRで入手できます。

In this paper, we propose a novel query design for the transformer-based object detection. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focuses on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10× fewer training epochs. For example, it achieves 44.2 AP with 19 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at https://github.com/megvii-research/AnchorDETR.

updated: Tue Jan 04 2022 08:20:42 GMT+0000 (UTC)

published: Wed Sep 15 2021 06:31:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト