FIT: Far-reaching Interleaved Transformers

Ting Chen; Lala Li

FIT: 広範囲にわたるインターリーブトランスフォーマー

私たちは、効率的なセルフアテンションと適応計算を備えたトランスフォーマーベースのアーキテクチャである FIT を紹介します。データトークンの単一シーケンスで動作する元のトランスフォーマーとは異なり、データトークンをグループに分割し、各グループはより短いトークンシーケンスになります。私たちは 2 種類のトランスフォーマー層を採用しています。ローカル層は各グループ内のデータトークンで動作し、グローバル層は導入された潜在トークンのより小さなセットで動作します。標準トランスフォーマーと同じセルフアテンション層とフィードフォワード層のセットで構成されるこれらの層はインターリーブされ、クロスアテンションを使用して、同じグループ内のデータと潜在トークン間の情報交換が容易になります。アテンションの複雑さは、サイズ n の各グループ内で局所的には O(n^2) ですが、シーケンス長が L の場合、グローバルには O(L^{4/3}) に達する可能性があります。グローバル層への依存度を高めることで、効率をさらに高めることができます。より小さな潜在トークンのセットを使用して適応計算を実行します。 FIT は多用途のアーキテクチャであり、エンコーダ、拡散デコーダ、または自己回帰デコーダとして機能できます。私たちは、高解像度画像の理解と生成タスクにおけるその有効性を実証する初期の証拠を提供します。特に、FIT は、特定の最適化やモデルの並列処理を必要とせずに、16 GB のメモリ容量内で、6400 × 6400 の画像や 160K のトークン (パッチトークン化後) などのギガビットスケールのデータに対してエンドツーエンドのトレーニングを実行できる可能性を示しています。

We present FIT: a transformer-based architecture with efficient self-attention and adaptive computation. Unlike original transformers, which operate on a single sequence of data tokens, we divide the data tokens into groups, with each group being a shorter sequence of tokens. We employ two types of transformer layers: local layers operate on data tokens within each group, while global layers operate on a smaller set of introduced latent tokens. These layers, comprising the same set of self-attention and feed-forward layers as standard transformers, are interleaved, and cross-attention is used to facilitate information exchange between data and latent tokens within the same group. The attention complexity is O(n^2) locally within each group of size n, but can reach O(L^{4/3}) globally for sequence length of L. The efficiency can be further enhanced by relying more on global layers that perform adaptive computation using a smaller set of latent tokens. FIT is a versatile architecture and can function as an encoder, diffusion decoder, or autoregressive decoder. We provide initial evidence demonstrating its effectiveness in high-resolution image understanding and generation tasks. Notably, FIT exhibits potential in performing end-to-end training on gigabit-scale data, such as 6400×6400 images, or 160K tokens (after patch tokenization), within a memory capacity of 16GB, without requiring specific optimizations or model parallelism.

updated: Thu May 25 2023 16:27:30 GMT+0000 (UTC)

published: Mon May 22 2023 03:56:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト