CBNet: A Composite Backbone Network Architecture for Object Detection

Tingting Liang; Xiaojie Chu; Yudong Liu; Yongtao Wang; Zhi Tang; Wei Chu; Jingdong Chen; Haibin Ling

CBNet: オブジェクト検出のための複合バックボーンネットワークアーキテクチャ

最新の最高性能の物体検出器は、バックボーンネットワークに大きく依存しています。バックボーンネットワークの進歩は、より効果的なネットワーク構造を探索することで一貫したパフォーマンスの向上をもたらします。このホワイトペーパーでは、事前トレーニング微調整パラダイムの下で既存のオープンソースの事前トレーニング済みバックボーンを使用して高性能検出器を構築するために、CBNetV2 という斬新で柔軟なバックボーンフレームワークを提案します。特に、CBNetV2 アーキテクチャは、複数の同一のバックボーンをグループ化し、複合接続を介して接続します。具体的には、複数のバックボーンネットワークの高レベルおよび低レベルの機能を統合し、受容野を徐々に拡張して、より効率的にオブジェクト検出を実行します。また、CBNet ベースの検出器のアシスタント監視によるより良いトレーニング戦略も提案します。複合バックボーンの事前トレーニングを追加しなくても、CBNetV2 はさまざまなバックボーン (CNN ベースとトランスフォーマーベース) およびほとんどの主流検出器のヘッド設計 (1 ステージと 2 ステージ、アンカーベースとアンカー) に適応できます。 -フリーベース)。実験では、単純にネットワークの深さと幅を増やすのと比較して、CBNetV2 が高性能バックボーンネットワークを構築するためのより効率的で効果的でリソースに優しい方法を導入するという強力な証拠が得られます。特に、当社の Dual-Swin-L は、単一モデルおよび単一スケールのテストプロトコルの下で、COCO テスト開発で 59.4% のボックス AP と 51.6% のマスク AP を達成し、最先端の結果よりも大幅に優れています ( 57.7% ボックス AP と 50.2% マスク AP) が Swin-L によって達成され、トレーニングスケジュールは 6 分の 1 に短縮されます。マルチスケールテストにより、追加のトレーニングデータを使用せずに、現在の最高の単一モデルの結果を 60.1% のボックス AP と 52.3% のマスク AP の新記録に押し上げました。コードは https://github.com/VDIGPKU/CBNetV2 で入手できます。

Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6×. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2.

updated: Tue Oct 18 2022 05:09:09 GMT+0000 (UTC)

published: Thu Jul 01 2021 13:05:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト