SepViT: Separable Vision Transformer

Wei Li; Xing Wang; Xin Xia; Jie Wu; Xuefeng Xiao; Min Zheng; Shiping Wen

SepViT：分離可能なビジョントランスフォーマー

ビジョントランスフォーマーは、一連のビジョンタスクで一般的な成功を収めています。ただし、多くの場合、高性能を実現するには膨大な量の計算が必要であり、リソースに制約のあるデバイスに展開するのは面倒です。これらの問題に対処するために、深さ方向に分離可能な畳み込みから教訓を引き出し、そのイデオロギーを模倣して、SepViTと略されるSeparableVisionTransformerを設計します。 SepViTは、ウィンドウ内およびウィンドウ間での情報の相互作用を、深く分離可能な自己注意を介して実行するのに役立ちます。新しいウィンドウトークンの埋め込みとグループ化された自己注意は、無視できる計算コストでウィンドウ間の注意関係をモデル化し、複数のウィンドウの長期的な視覚的依存関係をそれぞれキャプチャするために使用されます。さまざまなベンチマークタスクに関する広範な実験により、SepViTは、精度と遅延の間のトレードオフに関して最先端の結果を達成できることが実証されています。その中で、SepViTはImageNet-1K分類で84.0％のトップ1精度を達成し、同様の精度（CSWin、PVTV2など）と比較してレイテンシーを40％削減します。ダウンストリームビジョンタスクに関しては、FLOPが少ないSepViTは、ADE20Kセグメンテーションタスクで50.4％mIoU、RetinaNetベースのCOCO検出タスクで47.5 AP、マスクR-CNNベースのCOCO検出で48.7ボックスAPおよび43.9マスクAPを達成できます。セグメンテーションタスク。

Vision Transformers have witnessed prevailing success in a series of vision tasks. However, they often require enormous amount of computations to achieve high performance, which is burdensome to deploy on resource-constrained devices. To address these issues, we draw lessons from depthwise separable convolution and imitate its ideology to design the Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the information interaction within and among the windows via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to model the attention relationship among windows with negligible computational cost and capture a long-range visual dependencies of multiple windows, respectively. Extensive experiments on various benchmark tasks demonstrate SepViT can achieve state-of-the-art results in terms of trade-off between accuracy and latency. Among them, SepViT achieves 84.0% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin, PVTV2). As for the downstream vision tasks, SepViT with fewer FLOPs can achieve 50.4% mIoU on ADE20K semantic segmentation task, 47.5 AP on the RetinaNet-based COCO detection task, 48.7 box AP and 43.9 mask AP on Mask R-CNN-based COCO detection and segmentation tasks.

updated: Sat May 07 2022 08:20:10 GMT+0000 (UTC)

published: Tue Mar 29 2022 09:20:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト