Container: Context Aggregation Network

Peng Gao; Jiasen Lu; Hongsheng Li; Roozbeh Mottaghi; Aniruddha Kembhavi

コンテナ：コンテキスト集約ネットワーク

畳み込みニューラルネットワーク（CNN）は、コンピュータービジョンに遍在しており、無数の効果的かつ効率的なバリエーションがあります。最近、もともと自然言語処理で導入されたトランスフォーマーは、コンピュータービジョンでますます採用されています。アーリーアダプターは引き続きCNNバックボーンを採用していますが、最新のネットワークはエンドツーエンドのCNNフリーのTransformerソリューションです。最近の驚くべき発見は、従来の畳み込みコンポーネントやTransformerコンポーネントを使用しない単純なMLPベースのソリューションが、効果的な視覚的表現を生成できることを示しています。 CNN、Transformers、およびMLP-Mixerは完全に異なるアーキテクチャと見なすことができますが、実際には、ニューラルネットワークスタック内の空間コンテキストを集約するためのより一般的な方法の特殊なケースであることを示す統一されたビューを提供します。 \ model（CONText AggregatIon NEtwoRk）を紹介します。これは、トランスフォーマーの長距離相互作用を活用しながら、ローカル畳み込み演算の誘導バイアスを活用して収束速度を向上させることができる、マルチヘッドコンテキスト集約の汎用ビルディングブロックです。 CNNでよく見られます。より大きな入力画像解像度に依存するダウンストリームタスクにうまくスケーリングしないTransformerベースの方法とは対照的に、\ modellightという名前の効率的なネットワークは、DETR、RetinaNet、Mask-RCNNなどのオブジェクト検出およびインスタンスセグメンテーションネットワークで使用できます。 38.9、43.8、45.1の印象的な検出mAPと41.3のマスクmAPを取得し、同等の計算サイズとパラメーターサイズを持つResNet-50バックボーンと比較して、それぞれ6.6、7.3、6.9、および6.6ポイントの大幅な改善を提供します。私たちの方法はまた、DINOフレームワークのDeiTと比較して、自己監視学習で有望な結果を達成します。コードはhttps://github.com/allenai/containerでリリースされています。

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework. Code is released at https://github.com/allenai/container.

updated: Mon Oct 18 2021 06:52:31 GMT+0000 (UTC)

published: Wed Jun 02 2021 18:09:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト