CBNetV2: A Composite Backbone Network Architecture for Object Detection

Tingting Liang; Xiaojie Chu; Yudong Liu; Yongtao Wang; Zhi Tang; Wei Chu; Jingdong Chen; Haibin Ling

CBNetV2：オブジェクト検出のための複合バックボーンネットワークアーキテクチャ

最新の最高性能のオブジェクト検出器は、バックボーンネットワークに大きく依存しており、その進歩により、より効果的なネットワーク構造を探索することで、一貫したパフォーマンスの向上がもたらされます。この論文では、トレーニング前の微調整パラダイムの下で既存のオープンソースの事前トレーニング済みバックボーンを使用して高性能検出器を構築するための、新規で柔軟なバックボーンフレームワーク、つまりCBNetV2を提案します。特に、CBNetV2アーキテクチャは、複合接続を介して接続されている複数の同一のバックボーンをグループ化します。具体的には、複数のバックボーンネットワークの高レベルと低レベルの機能を統合し、受容野を徐々に拡大して、オブジェクト検出をより効率的に実行します。また、CBNetベースの検出器のアシスタント監視によるより良いトレーニング戦略を提案します。複合バックボーンの追加の事前トレーニングなしで、CBNetV2は、さまざまなバックボーン（CNNベースとトランスフォーマーベース）およびほとんどの主流検出器のヘッド設計（1ステージと2ステージ、アンカーベースとアンカー）に適合させることができます。 -無料ベース）。実験は、ネットワークの深さと幅を単純に増やすのと比較して、CBNetV2が高性能バックボーンネットワークを構築するためのより効率的で効果的でリソースに優しい方法を導入するという強力な証拠を提供します。特に、当社のDual-Swin-Lは、単一モデルおよび単一スケールのテストプロトコルの下でCOCO test-devで59.4％のボックスAPと51.6％のマスクAPを達成します。これは、最先端の結果よりも大幅に優れています（ Swin-Lによって57.7％のボックスAPと50.2％のマスクAP）が達成され、トレーニングスケジュールは6分の1に短縮されました。マルチスケールテストでは、追加のトレーニングデータを使用せずに、現在の最良の単一モデルの結果を60.1％のボックスAPと52.3％のマスクAPの新しい記録にプッシュします。コードはhttps://github.com/VDIGPKU/CBNetV2で入手できます。

Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6×. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2.

updated: Sat Jul 24 2021 16:50:16 GMT+0000 (UTC)

published: Thu Jul 01 2021 13:05:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト