Box-DETR: Understanding and Boxing Conditional Spatial Queries

Wenze Liu; Hao Lu; Yuliang Liu; Zhiguo Cao

Box-DETR: 条件付き空間クエリの理解とボックス化

条件付き空間クエリは最近、収束を加速するために DEtection TRansformer (DETR) に導入されました。 DAB-DETR では、このようなクエリは、ボックスの 4 つの端などの関心のある位置を検索することを目的として、各デコーダ段階でいわゆる条件付き線形投影によって調整されます。各デコーダステージは、アンカーボックスのオフセットを予測することによってボックスを徐々に更新しますが、クロスアテンションではボックスの中心のみが参照点として通知されます。ただし、ボックスの中心のみを使用すると、前のボックスの幅と高さが現在の段階では不明なままになり、オフセットの正確な予測が妨げられます。私たちは、クロスアテンションにおいてボックス情報全体を明示的に使用することが重要であると主張します。この研究では、ボックスをヘッド固有のエージェントポイントに凝縮するボックスエージェントを提案します。各ヘッドの基準点としてボックスの中心をエージェントポイントに置き換えることにより、条件付きクロスアテンションは、常に前のボックスの中心からではなく、前のボックスの全範囲を考慮して、より合理的な開始点から位置を検索できます。。これにより、条件付き線形投影の負担が大幅に軽減されます。実験結果は、ボックスエージェントがより高速な収束だけでなく、検出性能の向上にもつながることを示しています。たとえば、当社の単一スケールモデルは、DAB-DETR に基づく ResNet-50 で 44.2 AP を達成しています。 Box Agent はコードに若干の変更を必要とし、計算負荷は無視できます。コードは https://github.com/tiny-smart/box-detr で入手できます。

Conditional spatial queries are recently introduced into DEtection TRansformer (DETR) to accelerate convergence. In DAB-DETR, such queries are modulated by the so-called conditional linear projection at each decoder stage, aiming to search for positions of interest such as the four extremities of the box. Each decoder stage progressively updates the box by predicting the anchor box offsets, while in cross-attention only the box center is informed as the reference point. The use of only box center, however, leaves the width and height of the previous box unknown to the current stage, which hinders accurate prediction of offsets. We argue that the explicit use of the entire box information in cross-attention matters. In this work, we propose Box Agent to condense the box into head-specific agent points. By replacing the box center with the agent point as the reference point in each head, the conditional cross-attention can search for positions from a more reasonable starting point by considering the full scope of the previous box, rather than always from the previous box center. This significantly reduces the burden of the conditional linear projection. Experimental results show that the box agent leads to not only faster convergence but also improved detection performance, e.g., our single-scale model achieves 44.2 AP with ResNet-50 based on DAB-DETR. Our Box Agent requires minor modifications to the code and has negligible computational workload. Code is available at https://github.com/tiny-smart/box-detr.

updated: Mon Jul 17 2023 09:45:19 GMT+0000 (UTC)

published: Mon Jul 17 2023 09:45:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト