Vis-TOP: Visual Transformer Overlay Processor

Wei Hu; Dian Xu; Zimeng Fan; Fang Liu; Yanxiang He

Vis-TOP：VisualTransformerオーバーレイプロセッサ

近年、Transformerは自然言語処理（NLP）で良好な結果を達成し、コンピュータービジョン（CV）にも拡大し始めています。 VisionTransformerやSwinTransformerなどの優れたモデルが登場しました。同時に、Transformerモデルのプラットフォームは、リソースに敏感なアプリケーションシナリオに対応するために、組み込みデバイスに拡張されました。ただし、パラメーターの数が多いこと、複雑な計算フロー、およびTransformerモデルのさまざまな構造上のバリエーションがあるため、ハードウェア設計で対処する必要のある問題がいくつかあります。これはチャンスであると同時に挑戦でもあります。さまざまなビジュアルトランスフォーマーモデル用のオーバーレイプロセッサーであるVis-TOP（ビジュアルトランスフォーマーオーバーレイプロセッサー）を提案します。これは、CPU、GPU、NPEなどの粗粒度のオーバーレイプロセッサや、特定のモデル用の細粒度のカスタマイズされた設計とは異なります。 Vis-TOPは、すべてのビジュアルTransformerモデルの特性を要約し、ハードウェアアーキテクチャを変更せずにモデルを自由に切り替えたり変更したりできる、3層および2レベルの変換構造を実装します。同時に、対応する命令バンドルとハードウェアアーキテクチャは、3層および2レベルの変換構造で設計されています。 8ビットの固定小数点（fix_8）を使用してSwin Transformerの小さなモデルを量子化した後、ZCU102にオーバーレイプロセッサを実装しました。 GPUと比較して、TOPスループットは1.5倍高くなっています。既存のTransformerアクセラレータと比較すると、DSPあたりのスループットは他のアクセラレータの2.2倍から11.7倍です。つまり、このペーパーのアプローチは、リソース消費と推論速度の両方の観点から、リアルタイムAIの要件を満たしています。 Vis-TOPは、エッジでのコンピュータービジョン用の再構成可能なデバイスに基づいた、費用効果と電力効果の高いソリューションを提供します。

In recent years, Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV). Excellent models such as the Vision Transformer and Swin Transformer have emerged. At the same time, the platform for Transformer models was extended to embedded devices to meet some resource-sensitive application scenarios. However, due to the large number of parameters, the complex computational flow and the many different structural variants of Transformer models, there are a number of issues that need to be addressed in its hardware design. This is both an opportunity and a challenge. We propose Vis-TOP (Visual Transformer Overlay Processor), an overlay processor for various visual Transformer models. It differs from coarse-grained overlay processors such as CPU, GPU, NPE, and from fine-grained customized designs for a specific model. Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure that allows the model to be switched or changed freely without changing the hardware architecture. At the same time, the corresponding instruction bundle and hardware architecture are designed in three-layer and two-level transformation structure. After quantization of Swin Transformer tiny model using 8-bit fixed points (fix_8), we implemented an overlay processor on the ZCU102. Compared to GPU, the TOP throughput is 1.5x higher. Compared to the existing Transformer accelerators, our throughput per DSP is between 2.2x and 11.7x higher than others. In a word, the approach in this paper meets the requirements of real-time AI in terms of both resource consumption and inference speed. Vis-TOP provides a cost-effective and power-effective solution based on reconfigurable devices for computer vision at the edge.

updated: Thu Oct 21 2021 08:11:12 GMT+0000 (UTC)

published: Thu Oct 21 2021 08:11:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト