BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

Phuoc-Hoan Charles Le; Xinlin Li

BinaryViT: バイナリビジョントランスフォーマーを畳み込みモデルに向けて推進

ビジョントランスフォーマー (ViT) の人気の高まりとサイズの増大に伴い、コンピューティングリソースが限られているエッジデバイスに導入するために、ビジョントランスフォーマー (ViT) をより効率的にし、計算コストを削減することへの関心が高まっています。バイナリ化を使用すると、重みとアクティベーションがバイナリの場合にポップカウント操作を使用して、ViT モデルのサイズとその計算コストを大幅に削減できます。ただし、畳み込みニューラルネットワーク (CNN) の二値化手法や既存の二値化手法を直接適用して ViT を二値化すると、ImageNet-1k などのクラスが多数あるデータセット上の CNN と比較して、ViT のパフォーマンスが大幅に低下します。広範な分析の結果、DeiT などのバイナリバニラ ViT は、バイナリ CNN がバイナリバニラ ViT よりもはるかに高い表現能力を持つことを可能にする、CNN が持つ多くの重要なアーキテクチャ特性を見逃していることがわかりました。したがって、CNN アーキテクチャからインスピレーションを得て、CNN アーキテクチャから純粋な ViT アーキテクチャに演算を組み込み、畳み込みを導入せずにバイナリ ViT の表現能力を強化する BinaryViT を提案します。これらには、トークンプーリング層の代わりに平均プーリング層、複数の平均プーリングブランチを含むブロック、各メイン残差接続を追加する直前のアフィン変換、およびピラミッド構造が含まれます。 ImageNet-1k データセットの実験結果は、バイナリの純粋な ViT モデルが以前の最先端 (SOTA) バイナリ CNN モデルと競合できるこれらの操作の有効性を示しています。

With the increasing popularity and the increasing size of vision transformers (ViTs), there has been an increasing interest in making them more efficient and less computationally costly for deployment on edge devices with limited computing resources. Binarization can be used to help reduce the size of ViT models and their computational cost significantly, using popcount operations when the weights and the activations are in binary. However, ViTs suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods or existing binarization methods to binarize ViTs compared to CNNs on datasets with a large number of classes such as ImageNet-1k. With extensive analysis, we find that binary vanilla ViTs such as DeiT miss out on a lot of key architectural properties that CNNs have that allow binary CNNs to have much higher representational capability than binary vanilla ViT. Therefore, we propose BinaryViT, in which inspired by the CNN architecture, we include operations from the CNN architecture into a pure ViT architecture to enrich the representational capability of a binary ViT without introducing convolutions. These include an average pooling layer instead of a token pooling layer, a block that contains multiple average pooling branches, an affine transformation right before the addition of each main residual connection, and a pyramid structure. Experimental results on the ImageNet-1k dataset show the effectiveness of these operations that allow a binary pure ViT model to be competitive with previous state-of-the-art (SOTA) binary CNN models.

updated: Thu Jun 29 2023 04:48:02 GMT+0000 (UTC)

published: Thu Jun 29 2023 04:48:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト