Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

Yangji He; Weihan Liang; Dongyang Zhao; Hong-Yu Zhou; Weifeng Ge; Yizhou Yu; Wenqiang Zhang

属性は、少数のショットの学習のためのトランスフォーマーでの学習とスペクトルトークンのプーリングを代理します

このホワイトペーパーでは、属性サロゲート学習とスペクトルトークンプーリングを通じてデータ効率を向上させることができる、新しい階層的にカスケードされたトランスフォーマーを紹介します。ビジョントランスフォーマーは、最近、視覚認識のための畳み込みニューラルネットワークの有望な代替手段として考えられています。ただし、十分なデータがない場合は、過剰適合に陥り、パフォーマンスが低下します。データ効率を改善するために、スペクトルトークンのプーリングを通じて固有の画像構造を活用し、潜在的な属性サロゲートを通じて学習可能なパラメーターを最適化する、階層的にカスケードされたトランスフォーマーを提案します。固有の画像構造は、スペクトルトークンのプーリングによって前景コンテンツと背景ノイズの間のあいまいさを減らすために利用されます。また、属性代理学習スキームは、ラベルによって割り当てられた単純な視覚的概念ではなく、画像とラベルのペアの豊富な視覚情報から利益を得るように設計されています。 HCTransformersと呼ばれる階層的にカスケードされたトランスフォーマーは、自己監視型学習フレームワークDINOに基づいて構築されており、いくつかの人気のある数ショットの学習ベンチマークでテストされています。誘導設定では、HCTransformerはminiImageNetで9.7％の5ウェイ1ショット精度と9.17％の5ウェイ5ショット精度の大きなマージンでDINOベースラインを上回ります。これは、HCTransformerが識別機能を抽出するのに効率的であることを示しています。また、HCTransformersは、miniImageNet、tieredImageNet、FC100、およびCIFAR-FSを含む4つの一般的なベンチマークデータセットの5ウェイ1ショットおよび5ウェイ5ショット設定の両方で、SOTA少数ショット分類方法に比べて明らかな利点を示しています。トレーニングされたウェイトとコードは、https：//github.com/StomachCold/HCTransformersで入手できます。

This paper presents new hierarchically cascaded transformers that can improve data efficiency through attribute surrogates learning and spectral tokens pooling. Vision transformers have recently been thought of as a promising alternative to convolutional neural networks for visual recognition. But when there is no sufficient data, it gets stuck in overfitting and shows inferior performance. To improve data efficiency, we propose hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling and optimize the learnable parameters through latent attribute surrogates. The intrinsic image structure is utilized to reduce the ambiguity between foreground content and background noise by spectral tokens pooling. And the attribute surrogate learning scheme is designed to benefit from the rich visual information in image-label pairs instead of simple visual concepts assigned by their labels. Our Hierarchically Cascaded Transformers, called HCTransformers, is built upon a self-supervised learning framework DINO and is tested on several popular few-shot learning benchmarks. In the inductive setting, HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet, which demonstrates HCTransformers are efficient to extract discriminative features. Also, HCTransformers show clear advantages over SOTA few-shot classification methods in both 5-way 1-shot and 5-way 5-shot settings on four popular benchmark datasets, including miniImageNet, tieredImageNet, FC100, and CIFAR-FS. The trained weights and codes are available at https://github.com/StomachCold/HCTransformers.

updated: Thu Mar 17 2022 03:49:58 GMT+0000 (UTC)

published: Thu Mar 17 2022 03:49:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト