gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Mocho Go; Hideyuki Tachibana

gSwin: シフトされたウィンドウの階層構造を持つゲート MLP ビジョンモデル

言語領域での成功に続き、視覚領域でも自己注意機構（トランスフォーマー）が採用され、最近大きな成功を収めています。さらに、別の流れとして、視覚領域では多層パーセプトロン (MLP) も検討されています。これらのアーキテクチャは、従来の CNN 以外にも最近注目されており、多くの手法が提案されています。パラメータの効率とパフォーマンスを画像認識の局所性と階層と組み合わせたものとして、2 つのストリームをマージする gSwin を提案します。 Swin Transformer と (マルチヘッド) gMLP。 gSwin は、画像分類、オブジェクト検出、セマンティックセグメンテーションの 3 つのビジョンタスクで、モデルサイズが小さい Swin Transformer よりも高い精度を達成できることを示しました。

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

updated: Sat Sep 02 2023 08:14:57 GMT+0000 (UTC)

published: Wed Aug 24 2022 18:00:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト