InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang; Jifeng Dai; Zhe Chen; Zhenhang Huang; Zhiqi Li; Xizhou Zhu; Xiaowei Hu; Tong Lu; Lewei Lu; Hongsheng Li; Xiaogang Wang; Yu Qiao

InternImage: 変形可能な畳み込みを使用した大規模なビジョン基盤モデルの探索

近年の大規模なビジョントランスフォーマー (ViT) の大きな進歩と比較すると、畳み込みニューラルネットワーク (CNN) に基づく大規模なモデルはまだ初期段階にあります。この作業は、InternImage と呼ばれる新しい大規模な CNN ベースの基盤モデルを提示します。このモデルは、ViT などのパラメーターとトレーニングデータを増やすことで利益を得ることができます。大規模な高密度カーネルに焦点を当てた最近の CNN とは異なり、InternImage はコアオペレーターとして変形可能な畳み込みを使用するため、モデルは検出やセグメンテーションなどのダウンストリームタスクに必要な大きな有効受容野を持つだけでなく、適応空間集約も持ちます。入力およびタスク情報によって調整されます。その結果、提案された InternImage は、従来の CNN の厳密な帰納的バイアスを減らし、ViT のような大量のデータから大規模なパラメーターを使用して、より強力でロバストなパターンを学習できるようにします。モデルの有効性は、ImageNet、COCO、ADE20K などの困難なベンチマークで証明されています。 InternImage-H が COCO test-dev で 65.4 mAP の新記録を達成し、ADE20K で 62.9 mIoU を達成し、現在の主要な CNN と ViT を上回ったことは注目に値します。コードは https://github.com/OpenGVLab/InternImage で公開されます。

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

updated: Thu Mar 02 2023 18:13:33 GMT+0000 (UTC)

published: Thu Nov 10 2022 18:59:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト