Group Generalized Mean Pooling for Vision Transformer

Byungsoo Ko; Han-Gyu Kim; Byeongho Heo; Sangdoo Yun; Sanghyuk Chun; Geonmo Gu; Wonjae Kim

Vision Transformer のグループ一般化平均プーリング

Vision Transformer (ViT) は、自然言語処理 (NLP) の Transformer またはコンピュータービジョンの畳み込みニューラルネットワーク (CNN) のアーキテクチャに従って、クラストークンまたはすべてのパッチトークンの平均から最終的な表現を抽出します。ただし、パッチトークンを集約する最善の方法の研究はまだ平均プーリングに限定されていますが、max プーリングや GeM プーリングなどの広く使用されているプーリング戦略を検討することができます。それらの有効性にもかかわらず、既存のプーリング戦略は、ViT のアーキテクチャと活性化マップのチャネルごとの違いを考慮せず、重要なチャネルと些細なチャネルを同じ重要性で集約します。このホワイトペーパーでは、ViT のシンプルかつ強力なプーリング戦略として、Group Generalized Mean (GGeM) プーリングを紹介します。 GGeM はチャネルをグループに分割し、グループごとに共有プーリングパラメータを使用して GeM プーリングを計算します。 ViT はマルチヘッドアテンションメカニズムを介してチャネルをグループ化するため、GGeM によってチャネルをグループ化すると、活性化マップ上の重要なチャネルを増幅しながら、ヘッドごとの依存度が低下します。 GGeM を利用すると、ベースラインと比較して 0.1%p から 0.7%p のパフォーマンス向上が見られ、ImageNet-1K 分類タスクの ViT-Base および ViT-Large モデルで最先端のパフォーマンスが達成されます。さらに、GGeM は、画像検索およびマルチモーダル表現学習タスクに関する既存のプーリング戦略よりも優れており、さまざまなタスクに対する GGeM の優位性を示しています。 GGeM は、実装に数行のコードしか必要ないという点で単純なアルゴリズムです。

Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.

updated: Thu Dec 08 2022 07:13:59 GMT+0000 (UTC)

published: Thu Dec 08 2022 07:13:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト