Dynamic Grained Encoder for Vision Transformers

Lin Song; Songyang Zhang; Songtao Liu; Zeming Li; Xuming He; Hongbin Sun; Jian Sun; Nanning Zheng

ビジョントランスフォーマー向けのダイナミックグレインエンコーダー

言語モデリングのデファクトスタンダードであるトランスフォーマーは、最近、ビジョンタスクに適用されています。このホワイトペーパーでは、ビジョントランスフォーマーのスパースクエリを導入して、自然画像の固有の空間冗長性を活用し、計算コストを節約します。具体的には、各空間領域に適切な数のクエリを適応的に割り当てることができるビジョントランスフォーマー用の Dynamic Grained Encoder を提案します。したがって、高い効率を維持しながら、識別領域でのきめの細かい表現を実現します。さらに、ダイナミックグレインエンコーダーは、ほとんどのビジョントランスフォーマーフレームワークと互換性があります。ベルとホイッスルがなければ、当社のエンコーダーにより、画像分類で同等のパフォーマンスを維持しながら、最先端のビジョントランスフォーマーが計算の複雑さを 40% ～ 60% 削減できます。オブジェクトの検出とセグメンテーションに関する広範な実験により、私たちのアプローチの一般化可能性がさらに実証されます。コードは https://github.com/StevenGrove/vtpack で入手できます。

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

updated: Tue Jan 10 2023 07:55:29 GMT+0000 (UTC)

published: Tue Jan 10 2023 07:55:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト