Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Ting Yao; Yingwei Pan; Yehao Li; Chong-Wah Ngo; Tao Mei

Wave-ViT：視覚表現学習のためのウェーブレットとトランスフォーマーの統合

マルチスケールVisionTransformer（ViT）は、コンピュータービジョンタスクの強力なバックボーンとして登場しましたが、Transformerでの自己注意計算は、入力パッチ番号に応じて2乗でスケーリングします。したがって、既存のソリューションは通常、キー/値に対してダウンサンプリング操作（平均プーリングなど）を使用して、計算コストを大幅に削減します。この作業では、このような過度に攻撃的なダウンサンプリング設計は可逆的ではなく、必然的に、特にオブジェクト内の高周波コンポーネント（テクスチャの詳細など）の情報がドロップする原因になると主張します。ウェーブレット理論に動機付けられて、ウェーブレット変換と自己注意学習を統一した方法で反転可能なダウンサンプリングを定式化する新しいウェーブレットビジョントランスフォーマー（Wave-ViT）を構築します。この提案により、キー/値のロスレスダウンサンプリングによる自己注意学習が可能になり、効率と精度のトレードオフの追求が容易になります。さらに、逆ウェーブレット変換は、拡大された受容野でローカルコンテキストを集約することにより、自己注意出力を強化するために活用されます。複数のビジョンタスク（画像認識、オブジェクト検出、インスタンスセグメンテーションなど）に対する広範な実験を通じて、Wave-ViTの優位性を検証します。そのパフォーマンスは、同等のFLOPを備えた最先端のViTバックボーンを上回っています。ソースコードはhttps://github.com/YehLi/ImageNetModelで入手できます。

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (Wave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at https://github.com/YehLi/ImageNetModel.

updated: Mon Jul 11 2022 16:03:51 GMT+0000 (UTC)

published: Mon Jul 11 2022 16:03:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト