Recurrent Glimpse-based Decoder for Detection with Transformer

Zhe Chen; Jing Zhang; Dacheng Tao

Transformerで検出するためのRecurrentGlimpseベースのデコーダー

Transformer（DETR）を使用した検出はますます一般的になっていますが、そのグローバルアテンションモデリングでは、有望な検出パフォーマンスを最適化して達成するために、非常に長いトレーニング期間が必要です。トレーニングの問題に取り組むために主に高度な機能または埋め込み設計を開発する既存の研究の代わりに、関心領域（RoI）ベースの検出の改良により、DETRメソッドのトレーニングの難しさを簡単に軽減できることを指摘します。これに基づいて、この論文では、新しいREcurrent GlimpseベースのdecOder（REGO）を紹介します。特に、REGOは多段階の反復処理構造を採用しており、DETRの注意が徐々に前景オブジェクトに焦点を合わせるのに役立ちます。各処理段階で、視覚的特徴は、前の段階からの検出結果の拡大された境界ボックス領域を備えたRoIからの垣間見る特徴として抽出されます。次に、垣間見るベースのデコーダーが導入され、前のステージの垣間見る機能と注意モデリング出力の両方に基づいて、洗練された検出結果が提供されます。実際には、REGOは、完全にエンドツーエンドのトレーニングと推論のパイプラインを維持しながら、代表的なDETRバリアントに簡単に組み込むことができます。特に、REGOは、同等のパフォーマンスを達成するためにそれぞれ500エポックと50エポックを必要とする最初のDETRとDeformable DETRと比較して、わずか36トレーニングエポックでMSCOCOデータセットで44.8APを達成するDeformableDETRを支援します。実験はまた、REGOが50トレーニングエポックの同じ設定で最大7％の相対ゲインで異なるDETR検出器のパフォーマンスを一貫して向上させることを示しています。コードはhttps://github.com/zhechen/Deformable-DETR-REGOから入手できます。

Although detection with Transformer (DETR) is increasingly popular, its global attention modeling requires an extremely long training period to optimize and achieve promising detection performance. Alternative to existing studies that mainly develop advanced feature or embedding designs to tackle the training issue, we point out that the Region-of-Interest (RoI) based detection refinement can easily help mitigate the difficulty of training for DETR methods. Based on this, we introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper. In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects more accurately. In each processing stage, visual features are extracted as glimpse features from RoIs with enlarged bounding box areas of detection results from the previous stage. Then, a glimpse-based decoder is introduced to provide refined detection results based on both the glimpse features and the attention modeling outputs of the previous stage. In practice, REGO can be easily embedded in representative DETR variants while maintaining their fully end-to-end training and inference pipelines. In particular, REGO helps Deformable DETR achieve 44.8 AP on the MSCOCO dataset with only 36 training epochs, compared with the first DETR and the Deformable DETR that require 500 and 50 epochs to achieve comparable performance, respectively. Experiments also show that REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs. Code is available via https://github.com/zhechen/Deformable-DETR-REGO.

updated: Thu Dec 09 2021 00:29:19 GMT+0000 (UTC)

published: Thu Dec 09 2021 00:29:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト