An Attention Free Transformer

Shuangfei Zhai; Walter Talbott; Nitish Srivastava; Chen Huang; Hanlin Goh; Ruixiang Zhang; Josh Susskind

アテンションフリーのトランスフォーマー

ドット積自己注意の必要性を排除する効率的なトランスフォーマーである Attention Free Transformer (AFT) を紹介します。 AFT レイヤーでは、キーと値が最初に学習された位置バイアスのセットと組み合わされ、その結果が要素ごとの方法でクエリと乗算されます。この新しい操作には、コンテキストサイズとフィーチャの次元の両方に対して線形のメモリの複雑性があり、大きな入力サイズとモデルサイズの両方に互換性があります。また、AFT-local と AFT-conv という 2 つのモデルバリアントを紹介します。これらは、グローバル接続を維持しながら、局所性と空間ウェイトシェアリングのアイデアを活用しています。 2 つの自己回帰モデリングタスク (CIFAR10 と Enwik8) と画像認識タスク (ImageNet-1K 分類) について広範な実験を行っています。私たちは、AFT がすべてのベンチマークで競争力のあるパフォーマンスを示し、同時に優れた効率を提供することを示しています。

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

updated: Fri May 28 2021 20:45:30 GMT+0000 (UTC)

published: Fri May 28 2021 20:45:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト