ConTNet: Why not use convolution and transformer at the same time?

Haotian Yan; Zhe Li; Weijian Li; Changhu Wang; Ming Wu; Chuang Zhang

ConTNet：コンボリューションとトランスフォーマーを同時に使用してみませんか？

畳み込みネットワーク（ConvNets）は、コンピュータービジョン（CV）で大きな成功を収めていますが、オブジェクトの検出やセグメンテーションなどの高密度の予測タスクに不可欠なグローバル情報のキャプチャに悩まされています。この作業では、トランスフォーマーとConvNetアーキテクチャーを組み合わせて、大きな受容野を提供するConTNet（ConvolutionTransformer Network）を革新的に提案します。ハイパーパラメータに敏感で、中規模のデータセット（ImageNet1kなど）でゼロからトレーニングしたときにデータ拡張の山に非常に依存する最近提案されたトランスベースのモデル（ViT、DeiTなど）とは異なり、ConTNetは最適化できます通常のConvNet（ResNetなど）と同様に、優れた堅牢性を維持します。また、同じ強力なデータ拡張を考えると、ConTNetのパフォーマンスの向上はResNetのパフォーマンスの向上よりも顕著であることも指摘しておく価値があります。画像分類とダウンストリームタスクでの優位性と有効性を示します。たとえば、ConTNetはImageNetで81.8％のトップ1精度を達成します。これは、計算の複雑さが40％未満のDeiT-Bと同じです。 ConTNet-Mは、COCO2017データセットのFaster-RCNN（2.6％）とMask-RCNN（3.2％）の両方のバックボーンとしてResNet50よりも優れています。 ConTNetがCVタスクの有用なバックボーンとして機能し、モデル設計の新しいアイデアをもたらすことを願っています。

Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority and effectiveness on image classification and downstream tasks. For example, our ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B with less than 40% computational complexity. ConTNet-M also outperforms ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%) on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design

updated: Thu May 06 2021 20:37:49 GMT+0000 (UTC)

published: Tue Apr 27 2021 22:29:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト