Skip-Attention: Improving Vision Transformers by Paying Less Attention

Shashanka Venkataramanan; Amir Ghodrati; Yuki M. Asano; Fatih Porikli; Amirhossein Habibian

Skip-Attention: あまり注意を払わないことでビジョントランスフォーマーを改善する

この作業は、ビジョントランスフォーマー (ViT) の効率を向上させることを目的としています。 ViT はすべてのレイヤーで計算コストの高い自己注意操作を使用しますが、これらの操作はレイヤー間で高度に相関していることがわかりました。これは、不要な計算を引き起こす重要な冗長性です。この観察に基づいて、SkipAt を提案します。これは、前のレイヤーからの自己注意の計算を再利用して、1 つまたは複数の後続のレイヤーで注意を近似する方法です。レイヤー全体で自己注意ブロックを再利用してもパフォーマンスが低下しないようにするために、単純なパラメトリック関数を導入します。これは、ベースライントランスフォーマーのパフォーマンスを上回り、計算を高速に実行します。 ImageNet-1K での画像分類と自己教師あり学習、ADE20K でのセマンティックセグメンテーション、SIDD での画像ノイズ除去、および DAVIS でのビデオノイズ除去において、この方法の有効性を示します。これらすべてのタスクで、同等またはそれ以上の精度レベルでスループットの向上を達成しています。

This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.

updated: Tue Jan 17 2023 16:17:57 GMT+0000 (UTC)

published: Thu Jan 05 2023 18:59:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト