Scale-Aware Modulation Meet Transformer

Weifeng Lin; Ziheng Wu; Jiayu Chen; Jun Huang; Lianwen Jin

スケールを意識した変調とトランスフォーマーの融合

本稿では、畳み込みネットワークとビジョン Transformer を組み合わせることで、さまざまなダウンストリームタスクを効率的に処理できる新しいビジョン Transformer、Scale-Aware Modulation Transformer (SMT) について説明します。 SMT で提案されているスケールアウェア変調 (SAM) には、2 つの主要な新規設計が含まれています。まず、マルチスケール特徴を捕捉し、受容野を拡張できるマルチヘッド混合コンボリューション (MHMC) モジュールを紹介します。次に、軽量でありながら効果的であり、異なるヘッド間での情報融合を可能にする Scale-Aware Aggregation (SAA) モジュールを提案します。これら 2 つのモジュールを活用することで、畳み込み変調がさらに強化されます。さらに、アテンションフリーネットワークを構築するためにすべての段階で変調を利用した従来の研究とは対照的に、ネットワークが深くなるにつれて、ローカルな依存関係の捕捉からグローバルな依存関係への移行を効果的にシミュレートできる進化的ハイブリッドネットワーク (EHN) を提案します。優れたパフォーマンスで。広範な実験により、SMT が幅広い視覚タスクにわたって既存の最先端モデルを大幅に上回るパフォーマンスを示すことが実証されました。具体的には、11.5M / 2.4GFLOP および 32M / 7.7GFLOP の SMT は、ImageNet-1K 上でそれぞれ 82.2% および 84.3% のトップ 1 精度を達成できます。 ImageNet-22K で 224^2 解像度で事前トレーニングした後、解像度 224^2 と 384^2 で微調整すると、それぞれ 87.1% と 88.1% のトップ 1 精度を達成します。マスク R-CNN を使用した物体検出の場合、1x および 3x スケジュールでトレーニングされた SMT ベースは、Swin Transformer の対応物よりも COCO でそれぞれ 4.2 および 1.3 mAP 優れています。 UPerNet によるセマンティックセグメンテーションの場合、シングルスケールおよびマルチスケールでの SMT ベーステストは、ADE20K で Swin をそれぞれ 2.0 および 1.1 mIoU 上回りました。

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

updated: Mon Jul 17 2023 15:47:48 GMT+0000 (UTC)

published: Mon Jul 17 2023 15:47:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト