Focal Modulation Networks

Jianwei Yang; Chunyuan Li; Jianfeng Gao

焦点変調ネットワーク

この作業では、焦点変調ネットワーク（略してFocalNet）を提案します。このネットワークでは、自己注意（SA）が、トークンの相互作用をモデル化するためにより効果的かつ効率的な焦点変調モジュールに完全に置き換えられます。焦点変調は、次の3つのコンポーネントで構成されます。（i）さまざまな粒度レベルで短距離から長距離までの視覚的コンテキストをエンコードするための深さ方向の畳み込み層のスタックを使用して実装される階層的コンテキスト化、（ii）各視覚的要素のコンテキスト機能を選択的に集約するためのゲート集約その内容に基づくトークン（クエリ）、および（iii）集約された特徴をクエリベクトルに融合するための変調または要素ごとのアフィン変換。広範な実験により、FocalNetsは、画像分類、オブジェクト検出、およびセマンティックセグメンテーションのタスクで同様の時間とメモリコストで、最先端のSA対応物（Swin Transformersなど）よりも優れていることが示されています。具体的には、小型で基本サイズのFocalNetは、ImageNet-1Kで82.3％と83.9％のトップ1精度を達成します。 ImageNet-22Kで事前トレーニングした後、解像度224×224および384×384で微調整すると、それぞれ86.5％および87.3％のトップ1精度を達成します。 FocalNetは、ダウンストリームタスクに転送されたときに顕著な優位性を示します。 Mask R-CNNを使用したオブジェクト検出では、1xでトレーニングされたFocalNetベースは、3xスケジュールでトレーニングされたSwinをすでに上回っています（49.0対48.5）。 UperNetを使用したセマンティックセグメンテーションの場合、シングルスケールで評価されたFocalNetベースは、マルチスケールで評価されたSwinよりも優れています（50.5対49.7）。これらの結果により、焦点変調は、実際のアプリケーションで効果的かつ効率的なビジュアルモデリングを行うためのSAの好ましい代替手段になります。コードはhttps://github.com/microsoft/FocalNetで入手できます。

In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base sizes achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224×224 and 384×384, respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1× already surpasses Swin trained with 3× schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications. Code is available at https://github.com/microsoft/FocalNet.

updated: Tue Mar 22 2022 17:54:50 GMT+0000 (UTC)

published: Tue Mar 22 2022 17:54:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト