Focal Modulation Networks

Jianwei Yang; Chunyuan Li; Xiyang Dai; Lu Yuan; Jianfeng Gao

焦点変調ネットワーク

視覚におけるトークンの相互作用をモデル化するために、自己注意 (SA) が焦点変調メカニズムに完全に置き換えられる焦点変調ネットワーク (略して FocalNets) を提案します。焦点変調は、次の 3 つのコンポーネントで構成されます。(i) 深さ方向の畳み込みレイヤーのスタックを使用して実装された階層的コンテキスト化により、視覚コンテキストを短距離から長距離にエンコードします。(ii) コンテンツに基づいて各クエリトークンのコンテキストを選択的に収集するゲート集約。、および（iii）集約されたコンテキストをクエリに挿入するための要素ごとの変調またはアフィン変換。広範な実験により、FocalNets は、画像分類、オブジェクト検出、およびセグメンテーションのタスクで同様の計算コストで、最先端の SA の対応物 (Swin や Focal Transformers など) よりも優れていることが示されています。具体的には、小さなベースサイズの FocalNet は、ImageNet-1K で 82.3% および 83.9% のトップ 1 精度を達成します。解像度 2242 の ImageNet-22K で事前トレーニングした後、解像度 2242 および 3842 で微調整すると、それぞれ 86.5% および 87.3% のトップ 1 精度を達成します。ダウンストリームタスクに移行すると、FocalNet は明確な優位性を示します。 Mask R-CNN を使用したオブジェクト検出では、1x でトレーニングされた FocalNet ベースは、Swin の対応するものよりも 2.1 ポイント優れており、3xschedule でトレーニングされた Swin を既に上回っています (49.0 対 48.5)。 UPerNet を使用したセマンティックセグメンテーションでは、FocalNet ベースはシングルスケールで Swin を 2.4 上回っており、マルチスケールでは Swin を上回っています (50.5 対 49.7)。大規模な FocalNet と Mask2former を使用して、ADE20K セマンティックセグメンテーションで 58.5 mIoU、COCO パノプティックセグメンテーションで 57.9 PQ を達成しました。巨大な FocalNet と DINO を使用して、COCO の minival と test-dev でそれぞれ 64.2 と 64.3 mAP を達成し、Swinv2-G や BEIT-3 のようなはるかに大きな注意ベースのモデルの上に新しい SoTA を確立しました。コードは https://github.com/microsoft/FocalNet で入手できます。

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 2242 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 2242 and 3842, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1×outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3×schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.2 and 64.3 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code is available at https://github.com/microsoft/FocalNet.

updated: Tue Nov 01 2022 09:41:35 GMT+0000 (UTC)

published: Tue Mar 22 2022 17:54:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト