Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

John Guibas; Morteza Mardani; Zongyi Li; Andrew Tao; Anima Anandkumar; Bryan Catanzaro

適応フーリエニューラル演算子：変圧器用の効率的なトークンミキサー

ビジョントランスフォーマーは、表現学習で大きな成功を収めています。これは主に、自己注意による効果的なトークンの混合によるものです。ただし、これはピクセル数に応じて2次関数的にスケーリングされるため、高解像度の入力では実行できなくなります。この課題に対処するために、フーリエ領域での混合を学習する効率的なトークンミキサーとして、適応フーリエニューラル演算子（AFNO）を提案します。 AFNOは、入力解像度に依存することなく、トークンの混合を連続的なグローバル畳み込みとしてフレーム化できるようにする、演算子学習の原則的な基盤に基づいています。この原理は、以前はFNOの設計に使用されていました。これは、フーリエ領域でグローバル畳み込みを効率的に解決し、挑戦的な偏微分方程式の学習に有望であることが示されています。画像の不連続性や高解像度入力などの視覚表現学習の課題に対処するために、メモリと計算効率をもたらすFNOの原理的なアーキテクチャの変更を提案します。これには、チャネル混合重みにブロック対角構造を課すこと、トークン間で重みを適応的に共有すること、およびソフトしきい値と収縮を介して周波数モードをスパース化することが含まれます。結果として得られるモデルは、準線形の複雑さと高度に並列であり、シーケンスサイズに線形メモリがあります。 AFNOは、効率と精度の両方の点で、数ショットのセグメンテーションの自己注意メカニズムよりも優れています。 Segformer-B3バックボーンを使用したCityscapesセグメンテーションの場合、AFNOは65kのシーケンスサイズを処理でき、他の効率的な自己注意メカニズムよりも優れています。

Vision transformers have delivered tremendous success in representation learning. This is primarily due to effective token mixing through self attention. However, this scales quadratically with the number of pixels, which becomes infeasible for high-resolution inputs. To cope with this challenge, we propose Adaptive Fourier Neural Operator (AFNO) as an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution. This principle was previously used to design FNO, which solves global convolution efficiently in the Fourier domain and has shown promise in learning challenging PDEs. To handle challenges in visual representation learning such as discontinuities in images and high resolution inputs, we propose principled architectural modifications to FNO which results in memory and computational efficiency. This includes imposing a block-diagonal structure on the channel mixing weights, adaptively sharing weights across tokens, and sparsifying the frequency modes via soft-thresholding and shrinkage. The resulting model is highly parallel with a quasi-linear complexity and has linear memory in the sequence size. AFNO outperforms self-attention mechanisms for few-shot segmentation in terms of both efficiency and accuracy. For Cityscapes segmentation with the Segformer-B3 backbone, AFNO can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.

updated: Wed Nov 24 2021 05:44:31 GMT+0000 (UTC)

published: Wed Nov 24 2021 05:44:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト