BiFormer: Vision Transformer with Bi-Level Routing Attention

Lei Zhu; Xinjiang Wang; Zhanghan Ke; Wayne Zhang; Rynson Lau

BiFormer: Bi-Level Routing Attention を備えた Vision Transformer

ビジョントランスフォーマーのコアビルディングブロックとして、注意は長期的な依存関係を捉えるための強力なツールです。ただし、そのような能力には代償が伴います。すべての空間位置でのペアワイズトークンの相互作用が計算されるため、膨大な計算負荷と大量のメモリフットプリントが発生します。一連の研究では、アテンション操作をローカルウィンドウ、軸ストライプ、または拡張ウィンドウ内に制限するなど、手作業でコンテンツに依存しないスパース性をアテンションに導入することで、この問題を軽減しようとしています。これらのアプローチとは対照的に、コンテンツ認識による計算のより柔軟な割り当てを可能にするために、バイレベルルーティングを介した新しい動的スパースアテンションを提案します。具体的には、クエリの場合、無関係なキーと値のペアが最初に粗い領域レベルで除外され、次に残りの候補領域 (つまり、ルーティングされた領域) の結合で細かいトークン間のアテンションが適用されます。提案されたバイレベルルーティングアテンションのシンプルかつ効果的な実装を提供します。これはスパース性を利用して計算とメモリの両方を節約し、GPU に適した密な行列乗算のみを含みます。次に、提案されたバイレベルルーティングアテンションを使用して構築された、BiFormer という名前の新しい一般的なビジョントランスフォーマーが提示されます。 BiFormer は、他の無関係なものから気を散らすことなく、クエリ適応方式で関連するトークンの小さなサブセットに注意を払うため、特に高密度の予測タスクで、優れたパフォーマンスと高い計算効率の両方を享受できます。画像分類、オブジェクト検出、セマンティックセグメンテーションなど、いくつかのコンピュータービジョンタスクにわたる実験結果により、設計の有効性が検証されます。コードは https://github.com/rayleizhu/BiFormer で入手できます。

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e. , routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer.

updated: Wed Mar 15 2023 17:58:46 GMT+0000 (UTC)

published: Wed Mar 15 2023 17:58:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト