Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren; Hanjun Dai; Zihang Dai; Mengjiao Yang; Jure Leskovec; Dale Schuurmans; Bo Dai

コンバイナ：スパース計算コストのフルアテンショントランスフォーマー

トランスフォーマーは、シーケンスモデリングに非常に効果的な表現力豊かなアーキテクチャのクラスを提供します。ただし、トランスフォーマーの主な制限は、アテンションレイヤーのシーケンス長に関する2次メモリと時間計算量O（L ^ 2）であり、これにより、非常に長いシーケンスでのアプリケーションが制限されます。ほとんどの既存のアプローチは、コストを削減するために注意マトリックスのスパース性または低ランクの仮定を活用しますが、表現力を犠牲にします。代わりに、低い計算とメモリの複雑さを維持しながら、各アテンションヘッドで完全なアテンション機能を提供するCombinerを提案します。重要なアイデアは、自己注意メカニズムを各場所での埋め込みに対する条件付き期待値として扱い、構造化された因数分解で条件付き分布を近似することです。各場所は、直接的な注意を介して、または抽象化への間接的な注意を介して、他のすべての場所に参加できます。これも、対応するローカル領域からの埋め込みの条件付き期待値です。既存のスパーストランスフォーマーで使用されるほとんどのスパースアテンションパターンは、完全なアテンションのためにそのような因数分解の設計を刺激し、同じサブ二次コスト（O（Llog（L））またはO（LL））をもたらすことができることを示します。コンバイナーは、既存のトランスフォーマーのアテンションレイヤーのドロップイン代替品であり、一般的なフレームワークで簡単に実装できます。自己回帰シーケンスタスクと双方向シーケンスタスクの両方での実験的評価は、このアプローチの有効性を示し、いくつかの画像およびテキストモデリングタスクで最先端の結果をもたらします。

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity O(L^2) with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost (O(Llog(L)) or O(LL)). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

updated: Mon Jul 12 2021 22:43:11 GMT+0000 (UTC)

published: Mon Jul 12 2021 22:43:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト