BiViT: Extremely Compressed Binary Vision Transformer

Yefei He; Zhenyu Lou; Luoming Zhang; Weijia Wu; Bohan Zhuang; Hong Zhou

BiViT: 非常に圧縮されたバイナリビジョントランスフォーマー

モデルの 2 値化により、モデルサイズを大幅に圧縮し、エネルギー消費を削減し、効率的なビット単位の演算によって推論を加速できます。畳み込みニューラルネットワークの 2 値化は広く研究されていますが、視覚認識における最近のブレークスルーを支えるビジョントランスフォーマーでの 2 値化を調査する作業はほとんどありません。この目的のために、Binary Vision Transformers (BiViT) の地平線を押し広げるために、2 つの基本的な課題を解決することを提案します。第 1 に、従来のバイナリ法ではソフトマックスアテンションのロングテール分布が考慮されていないため、アテンションモジュールで大きな二値化エラーが発生します。これを解決するために、データ分布に動的に適応し、二値化によって引き起こされるエラーを減らす Softmax-aware 二値化を提案します。次に、事前トレーニング済みモデルの情報をより有効に活用して精度を回復するために、クロスレイヤー 2 値化スキームを提案し、重みの 2 値化に学習可能なチャネル単位のスケーリング係数を導入します。前者は自己注意と MLP の 2 値化を分離して相互干渉を回避し、後者は 2 値化モデルの表現能力を高めます。全体として、私たちの方法は、TinyImageNet データセットで最先端技術に対して 19.8% 有利に機能します。 ImageNet では、BiViT は Swin-T モデルを上回る 70.8% のトップ 1 精度を達成し、既存の SOTA メソッドを明らかに上回っています。

Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization on vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better exploit the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme and introduce learnable channel-wise scaling factors for weight binarization. The former decouples the binarization of self-attention and MLP to avoid mutual interference while the latter enhances the representation capacity of binarized models. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, BiViT achieves a competitive 70.8% Top-1 accuracy over Swin-T model, outperforming the existing SOTA methods by a clear margin.

updated: Mon Nov 14 2022 03:36:38 GMT+0000 (UTC)

published: Mon Nov 14 2022 03:36:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト