SepViT: Separable Vision Transformer

Wei Li; Xing Wang; Xin Xia; Jie Wu; Jiashi Li; Xuefeng Xiao; Min Zheng; Shiping Wen

SepViT: 分離型ビジョントランスフォーマー

ビジョントランスフォーマーは、一連のビジョンタスクで大きな成功を収めてきました。ただし、これらの Transformer は、高いパフォーマンスを実現するために膨大な計算コストに依存することが多く、リソースに制約のあるデバイスに導入するには負担がかかります。この問題を軽減するために、深さ方向の分離可能な畳み込みから教訓を引き出し、そのイデオロギーを模倣して効率的な Transformer バックボーン、つまり SepViT と略される Separable Vision Transformer を設計します。 SepViT は、深さ方向に分離可能な自己注意を介して、ウィンドウ内およびウィンドウ間でローカルとグローバルの情報の対話を順番に実行するのに役立ちます。新しいウィンドウトークンの埋め込みとグループ化されたセルフアテンションを使用して、無視できるコストでウィンドウ間の注意関係を計算し、複数のウィンドウにわたる長距離の視覚的インタラクションを確立します。汎用ビジョンベンチマークに関する広範な実験により、SepViT がパフォーマンスと遅延の間の最先端のトレードオフを達成できることが実証されました。その中で、SepViT は、同様の精度を持つもの (CSWin など) と比較して、ImageNet-1K 分類で 84.2% のトップ 1 精度を達成し、待ち時間を 40% 短縮します。さらに、SepViT は、ADE20K セマンティックセグメンテーションタスクで 51.0% の mIoU、RetinaNet ベースの COCO 検出タスクで 47.9 AP、Mask R-CNN ベースの COCO オブジェクト検出およびインスタンスセグメンテーションタスクで 49.4 ボックス AP および 44.6 マスク AP を達成しました。

Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision benchmarks demonstrate that SepViT can achieve a state-of-the-art trade-off between performance and latency. Among them, SepViT achieves 84.2% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin). Furthermore, SepViT achieves 51.0% mIoU on ADE20K semantic segmentation task, 47.9 AP on the RetinaNet-based COCO detection task, 49.4 box AP and 44.6 mask AP on Mask R-CNN-based COCO object detection and instance segmentation tasks.

updated: Thu Jun 15 2023 16:37:26 GMT+0000 (UTC)

published: Tue Mar 29 2022 09:20:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト