Vision Transformer with Quadrangle Attention

Qiming Zhang; Jing Zhang; Yufei Xu; Dacheng Tao

Quadrangle Attention を備えたビジョントランスフォーマー

ウィンドウベースの注意は、その優れたパフォーマンス、計算の複雑さの軽減、およびメモリフットプリントの削減により、ビジョントランスフォーマーで一般的な選択肢になりました。ただし、データに依存しない手作りのウィンドウの設計は、さまざまなサイズ、形状、向きのオブジェクトに適応するためのトランスフォーマーの柔軟性を制限します。この問題に対処するために、ウィンドウベースのアテンションを一般的な四角形の定式化に拡張する新しい四角形アテンション (QA) メソッドを提案します。私たちの方法は、エンドツーエンドの学習可能な四角形回帰モジュールを採用しており、変換行列を予測してデフォルトウィンドウをトークンサンプリングとアテンション計算用のターゲット四角形に変換し、ネットワークがさまざまな形状と向きでさまざまなターゲットをモデル化し、豊富なコンテキスト情報を取得できるようにします。 QA をプレーンで階層的なビジョントランスフォーマーに統合して、QFormer という名前の新しいアーキテクチャを作成します。これにより、コードのマイナーな変更とごくわずかな追加の計算コストが提供されます。公開ベンチマークでの広範な実験により、QFormer は、分類、オブジェクト検出、セマンティックセグメンテーション、ポーズ推定などのさまざまなビジョンタスクで、既存の代表的なビジョントランスフォーマーよりも優れていることが実証されています。コードは、https://github.com/ViTAE-Transformer/QFormerQFormer で公開されます。

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at https://github.com/ViTAE-Transformer/QFormerQFormer.

updated: Mon Mar 27 2023 11:13:50 GMT+0000 (UTC)

published: Mon Mar 27 2023 11:13:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト