Vision Transformers with Hierarchical Attention

Yun Liu; Yu-Huan Wu; Guolei Sun; Le Zhang; Ajad Chhatkuli; Luc Van Gool

階層的な注意を払ったビジョントランスフォーマー

このホワイトペーパーでは、マルチヘッドセルフアテンション（MHSA）の計算/スペースの複雑さが高いために発生するビジョントランスフォーマーの低効率の欠陥に取り組みます。この目的のために、その表現が階層的に計算される階層型MHSA（H-MHSA）を提案します。具体的には、最初に入力画像を一般的に行われているようにパッチに分割し、各パッチはトークンとして表示されます。次に、提案されたH-MHSAは、ローカルパッチ内のトークン関係を学習し、ローカル関係モデリングとして機能します。次に、小さなパッチが大きなパッチにマージされ、H-MHSAは、マージされた少数のトークンのグローバル依存関係をモデル化します。最後に、ローカルおよびグローバルの注意深い機能を集約して、強力な表現能力を備えた機能を取得します。各ステップで限られた数のトークンに対してのみ注意を計算するため、計算負荷が大幅に削減されます。したがって、H-MHSAは、きめ細かい情報を犠牲にすることなく、トークン間のグローバルな関係を効率的にモデル化できます。 H-MHSAモジュールを組み込むことで、階層的注意ベースのTransformerネットワークのファミリー、つまりHAT-Netを構築します。シーンの理解におけるHAT-Netの優位性を実証するために、画像分類、セマンティックセグメンテーション、オブジェクト検出、インスタンスセグメンテーションなどの基本的なビジョンタスクに関する広範な実験を実施します。したがって、HAT-Netはビジョントランスフォーマーに新しい視点を提供します。コードと事前トレーニング済みモデルは、https：//github.com/yun-liu/HAT-Netで入手できます。

This paper tackles the low-efficiency flaw of the vision transformer caused by the high computational/space complexity in Multi-Head Self-Attention (MHSA). To this end, we propose the Hierarchical MHSA (H-MHSA), whose representation is computed in a hierarchical manner. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for the vision transformer. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

updated: Wed Jun 15 2022 15:15:28 GMT+0000 (UTC)

published: Sun Jun 06 2021 17:01:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト