UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Zhigang Dai; Bolun Cai; Yugeng Lin; Junying Chen

UP-DETR: トランスフォーマーを使用した物体検出のための教師なし事前トレーニング

物体検出用の DEtection TRansformer (DETR) は、トランスフォーマーエンコーダー/デコーダーアーキテクチャにより、Faster R-CNN と比較して競争力のあるパフォーマンスを達成します。ただし、スクラッチトランスフォーマーを使用してトレーニングした DETR は、COCO データセットであっても大規模なトレーニングデータと非常に長いトレーニングスケジュールを必要とします。自然言語処理における事前トレーニング変換器の大成功に触発されて、教師なし事前トレーニング DETR (UP-DETR) におけるランダムクエリパッチ検出という新しい口実タスクを提案します。具体的には、指定された画像からパッチをランダムにトリミングし、それらをクエリとしてデコーダに供給します。モデルは、入力画像からこれらのクエリパッチを検出するように事前トレーニングされています。事前トレーニング中に、マルチタスク学習とマルチクエリローカリゼーションという 2 つの重要な問題に取り組みます。 (1) プレテキストタスクにおける分類と位置特定の優先順位をトレードオフするには、CNN バックボーンを凍結することが事前トレーニング変換器の成功の前提条件であることがわかりました。 (2) マルチクエリ位置特定を実行するために、アテンションマスクを使用したマルチクエリパッチ検出を備えた UP-DETR を開発します。さらに、UP-DETR は、物体検出とワンショット検出タスクを微調整するための統合された視点も提供します。私たちの実験では、UP-DETR は物体検出、ワンショット検出、およびパノプティックセグメンテーションにおける収束の高速化と平均精度の向上により、DETR のパフォーマンスを大幅に向上させました。コードと事前トレーニングモデル: https://github.com/dddzg/up-detr。

DEtection TRansformer (DETR) for object detection reaches competitive performance compared with Faster R-CNN via a transformer encoder-decoder architecture. However, trained with scratch transformers, DETR needs large-scale training data and an extreme long training schedule even on COCO dataset. Inspired by the great success of pre-training transformers in natural language processing, we propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the input image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we find that freezing the CNN backbone is the prerequisite for the success of pre-training transformers. (2) To perform multi-query localization, we develop UP-DETR with multi-query patch detection with attention mask. Besides, UP-DETR also provides a unified perspective for fine-tuning object detection and one-shot detection tasks. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: https://github.com/dddzg/up-detr.

updated: Mon Jul 24 2023 11:28:46 GMT+0000 (UTC)

published: Wed Nov 18 2020 05:16:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト