Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Byungseok Roh; JaeWoong Shin; Wuhyun Shin; Saehoon Kim

スパースDETR：学習可能なスパース性を備えた効率的なエンドツーエンドのオブジェクト検出

DETRは、トランスエンコーダ-デコーダアーキテクチャを使用した最初のエンドツーエンドオブジェクト検出器であり、高解像度の特徴マップで競争力のあるパフォーマンスを示しますが、計算効率は低くなります。その後の作業であるDeformableDETRは、密な注意を変形可能な注意に置き換えることでDETRの効率を高め、10倍高速な収束とパフォーマンスの向上を実現します。変形可能なDETRは、マルチスケール機能を使用してパフォーマンスを向上させますが、エンコーダトークンの数はDETRと比較して20倍に増加し、エンコーダの注意の計算コストがボトルネックのままです。予備実験では、エンコーダトークンの一部のみを更新しても検出性能が低下することはほとんどありません。この観察に触発されて、デコーダーによって参照されると予想されるトークンのみを選択的に更新するスパースDETRを提案します。これにより、モデルがオブジェクトを効果的に検出できるようになります。さらに、エンコーダーで選択したトークンに補助検出損失を適用すると、計算のオーバーヘッドを最小限に抑えながらパフォーマンスが向上することを示します。 COCOデータセットにエンコーダトークンが10％しかない場合でも、スパースDETRが変形可能DETRよりも優れたパフォーマンスを実現することを検証します。エンコーダトークンのみがスパース化されていますが、Deformable DETRと比較して、合計計算コストは38％減少し、フレーム/秒（FPS）は42％増加します。コードはhttps://github.com/kakaobrain/sparse-detrで入手できます。

DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

updated: Mon Nov 29 2021 05:22:46 GMT+0000 (UTC)

published: Mon Nov 29 2021 05:22:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト