Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Dahun Kim; Anelia Angelova; Weicheng Kuo

ビジョントランスフォーマーを使用したオープンボキャブラリーオブジェクト検出のための地域認識型事前トレーニング

私たちは、画像レベルの事前トレーニングとオープン語彙のオブジェクト検出の間のギャップを埋める、対照的な画像とテキストの事前トレーニングレシピである、地域を意識したオープン語彙ビジョントランスフォーマー (RO-ViT) を紹介します。事前トレーニング段階では、画像全体の位置埋め込みを使用する代わりに、位置埋め込みの領域をランダムに切り取ってサイズ変更することを提案します。これは、検出微調整段階での領域レベルでの位置埋め込みの使用とよりよく一致します。さらに、有益ではあるが難しい例をより良く学習するために、対照学習における一般的なソフトマックスのクロスエントロピー損失を焦点損失に置き換えます。最後に、新しいオブジェクトの提案における最近の進歩を利用して、オープン語彙検出の微調整を改善します。 LVIS および COCO のオープン語彙検出ベンチマークとゼロショット転送で完全なモデルを評価します。 RO-ViT は、LVIS 上で最先端の 32.1 AP_r を達成し、競合するゼロショット転送検出に加えて、既存の最高のアプローチを +5.8 ポイント上回ります。驚くべきことに、RO-ViT は画像レベルの表現も改善し、COCO および Flickr の画像テキスト検索ベンチマークの 12 指標のうち 9 指標で最先端の水準を達成し、大規模なモデルを使用した競合アプローチを上回っています。

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

updated: Thu May 11 2023 17:53:29 GMT+0000 (UTC)

published: Thu May 11 2023 17:53:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト