EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Junting Pan; Adrian Bulat; Fuwen Tan; Xiatian Zhu; Lukasz Dudziak; Hongsheng Li; Georgios Tzimiropoulos; Brais Martinez

EdgeViTs：ビジョントランスフォーマーを備えたモバイルデバイス上の競合する軽量CNN

ビジョントランスフォーマー（ViT）などの自己注意ベースのモデルは、コンピュータービジョンの畳み込みニューラルネットワーク（CNN）に代わる非常に競争力のあるアーキテクチャとして登場しました。自己注意の二次複雑さのために、認識精度がますます高くなる、ますます強力なバリアントにもかかわらず、既存のViTは通常、計算とモデルサイズに要求があります。以前のCNNのいくつかの成功した設計の選択（たとえば、畳み込みや階層的多段階構造）が最近のViTに再導入されましたが、モバイルデバイスの限られたリソース要件を満たすにはまだ十分ではありません。これは、最先端のMobileNet-v2に基づいて軽量のViTを開発するというごく最近の試みを動機付けていますが、それでもパフォーマンスのギャップを残しています。この作業では、この十分に研究されていない方向にさらに進んで、EdgeViTを紹介します。これは、注意ベースのビジョンモデルが、間のトレードオフで最高の軽量CNNと競合できるようにする軽量ViTの新しいファミリです。精度とデバイス上の効率。これは、自己注意と畳み込みの最適な統合に基づく、費用対効果の高いローカル-グローバル-ローカル（LGL）情報交換のボトルネックを導入することによって実現されます。デバイス専用の評価では、FLOPの数やパラメーターのような不正確なプロキシに依存するのではなく、デバイス上の遅延と初めてのエネルギー効率に直接焦点を当てる実用的なアプローチを採用しています。具体的には、精度と遅延の両方と精度とエネルギーのトレードオフを考慮した場合、モデルがパレート最適であり、ほとんどすべての場合に他のViTに対して厳密な優位性を達成し、最も効率的なCNNと競合することを示します。

Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy trade-offs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs.

updated: Fri May 06 2022 18:17:19 GMT+0000 (UTC)

published: Fri May 06 2022 18:17:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト