Continual Transformers: Redundancy-Free Attention for Online Inference

Lukas Hedegaard; Arian Bakhtiarnia; Alexandros Iosifidis

Continual Transformers: オンライン推論のための冗長性のない注意

一般的な形式のトランスフォーマーは、一度に 1 つのトークンではなく、トークンシーケンス全体を操作するように本質的に制限されています。その結果、時系列データのオンライン推論中にそれらを使用すると、連続するトークンシーケンスが重複するため、かなりの冗長性が伴います。この作業では、Transformers が継続的な入力ストリームに対して効率的なオンライントークンごとの推論を実行できるようにする、Scaled Dot-Product Attention の新しい定式化を提案します。重要なことは、私たちの変更は純粋に計算の順序であり、出力と学習された重みは元の Transformer Encoder のものと同じです。 THUMOS14、TVSeries、および GTZAN データセットでの実験により、当社の Continual Transformer Encoder を検証し、顕著な結果をもたらしました。当社の継続的な 1 ブロックおよび 2 ブロックアーキテクチャは、予測パフォーマンスを維持しながら、予測ごとの浮動小数点演算をそれぞれ最大 63 倍および 2.6 倍削減します。 .

Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance.

updated: Mon Nov 07 2022 07:56:35 GMT+0000 (UTC)

published: Mon Jan 17 2022 08:20:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト