EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Jiangning Zhang; Xiangtai Li; Yabiao Wang; Chengjie Wang; Yibo Yang; Yong Liu; Dacheng Tao

EATFormer：進化的アルゴリズムに触発されたビジョントランスフォーマーの改善

生物学的進化に動機付けられて、この論文は、実証済みの実用的な進化的アルゴリズム（EA）との類推によって、Vision Transformerの合理性を説明し、両方が一貫した数学的定式化を持っていることを導き出します。次に、効果的なEAバリアントに触発されて、提案されたEAベースのトランスフォーマー（EAT）ブロックのみを含む、新しいピラミッドEATFormerバックボーンを提案します。（GLI）、およびフィードフォワードネットワーク（FFN）モジュール。マルチスケール、インタラクティブ、および個別の情報を個別にモデル化します。さらに、変圧器のバックボーンとドッキングしたタスク関連ヘッド（TRH）を設計して、最終的な情報の融合をより柔軟に完了し、変調変形可能MSA（MD-MSA）を改善して、不規則な場所を動的にモデル化します。画像分類、ダウンストリームタスク、および説明実験に関する大規模な定量的および定量的実験は、最先端（SOTA）メソッドに対する私たちのアプローチの有効性と優位性を示しています。たとえば、モバイル（1.8M）、タイニー（6.1M）、スモール（24.3M）、およびベース（49.0M）モデルは、ナイーブトレーニングを使用してImageNet-1Kでのみトレーニングされた69.4、78.4、83.1、および83.9Top-1を実現します。レシピ; EATFormer-Tiny / Small / Base armed Mask-R-CNNは、COCO検出で45.4 / 47.4/49.0ボックスAPと41.4/42.9 / 44.2マスクAPを取得し、現在のMPViT-T、Swin-T、およびSwin-Sを0.6/上回っています。 1.4/0.5ボックスAPと0.4/1.3 / 0.9マスクAPを別々に使用し、FLOPを減らします。当社のEATFormer-Small/Baseは、UpernetによるADE20Kで47.3 / 49.3 mIoUを達成し、Swin-T/Sを2.8/1.7上回っています。コードはhttps：// https：//github.com/zhangzjn/EATFormerで入手できます。

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block, which consists of three residual parts, i.e. , Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Network (FFN) modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion more flexibly and improve a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over State-Of-The-Art (SOTA) methods. \Eg, our Mobile (1.8M), Tiny (6.1M), Small (24.3M), and Base (49.0M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code will be available at https://https://github.com/zhangzjn/EATFormer.

updated: Sun Jun 19 2022 04:49:35 GMT+0000 (UTC)

published: Sun Jun 19 2022 04:49:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト