Container: Context Aggregation Network

Peng Gao; Jiasen Lu; Hongsheng Li; Roozbeh Mottaghi; Aniruddha Kembhavi

コンテナ: コンテキスト集約ネットワーク

畳み込みニューラルネットワーク (CNN) は、コンピュータービジョンの至る所にあり、無数の効果的かつ効率的なバリエーションがあります。最近、トランスフォーマー (もともとは自然言語処理で導入されたもの) が、コンピュータービジョンでますます採用されるようになりました。アーリーアダプターは引き続き CNN バックボーンを採用していますが、最新のネットワークはエンドツーエンドの CNN フリーの Transformer ソリューションです。最近の驚くべき発見は、従来の畳み込みコンポーネントや Transformer コンポーネントを使用しない単純な MLP ベースのソリューションが効果的な視覚的表現を生成できることを示しています。 CNN、トランスフォーマー、および MLP-Mixer は完全に異なるアーキテクチャと見なされる可能性がありますが、それらが実際にはニューラルネットワークスタックで空間コンテキストを集約するためのより一般的な方法の特殊なケースであることを示す統一されたビューを提供します。 \model (CONText AggregatIon NEtwoRk) を紹介します。これは、マルチヘッドコンテキストアグリゲーションの汎用ビルディングブロックです。これは、トランスフォーマーのように長距離の相互作用を利用できる一方で、ローカルコンボリューション操作の誘導バイアスを利用して収束速度を高速化できます。 CNN でよく見られます。より大きな入力画像解像度に依存するダウンストリームタスクにうまく拡張できない Transformer ベースの方法とは対照的に、\modellight という名前の効率的なネットワークは、DETR、RetinaNet、Mask-RCNN などのオブジェクト検出およびインスタンスセグメンテーションネットワークで使用できます。 38.9、43.8、45.1 の印象的な検出 mAP と 41.3 のマスク mAP を取得するには、同等の計算とパラメーターサイズを備えた ResNet-50 バックボーンと比較して、それぞれ 6.6、7.3、6.9、および 6.6 ポイントの大幅な改善を提供します。私たちの方法は、DINO フレームワークの DeiT と比較して、自己教師あり学習でも有望な結果を達成しています。

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

updated: Wed Jun 02 2021 18:09:11 GMT+0000 (UTC)

published: Wed Jun 02 2021 18:09:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト