DETR Doesn't Need Multi-Scale or Locality Design

Yutong Lin; Yuhui Yuan; Zheng Zhang; Chen Li; Nanning Zheng; Han Hu

DETR はマルチスケールまたは局所性設計を必要としません

この論文では、「単純な」性質を維持する改良された DETR 検出器を紹介します。これは、複数のアーキテクチャ上の誘導バイアスを再導入する以前の主要な DETR ベースの検出器とは対照的に、特定の局所性制約なしで単一スケールの特徴マップとグローバルクロスアテンション計算を使用します。スケールと局所性をデコーダに入力します。我々は、2 つの単純なテクノロジーが単純な設計内で驚くほど効果的で、マルチスケールの特徴マップと局所性制約の欠如を補うことを示します。 1 つ目は、クロスアテンション定式化に追加されたボックス対ピクセル相対位置バイアス (BoxRPB) 項です。これは、各クエリが対応するオブジェクト領域に注意を払うように適切にガイドすると同時に、エンコードの柔軟性も提供します。 2 つ目は、マスクイメージモデリング (MIM) ベースのバックボーンの事前トレーニングです。これは、きめの細かい位置特定能力を備えた表現を学習するのに役立ち、マルチスケールフィーチャマップへの依存関係を修正するために重要であることがわかります。これらのテクノロジーと、トレーニングと問題形成における最近の進歩を組み込むことにより、改良された「単純な」DETR は、元の DETR 検出器に比べて優れた改善を示しました。事前トレーニングに Object365 データセットを活用することで、Swin-L バックボーンを使用して 63.9 mAP の精度を達成しました。これは、マルチスケールの特徴マップと領域ベースの特徴に大きく依存する最先端の検出器と非常に競争力があります。抽出。コードは https://github.com/impiga/Plain-DETR で入手できます。

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

updated: Thu Aug 03 2023 17:59:04 GMT+0000 (UTC)

published: Thu Aug 03 2023 17:59:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト