Self-Promoted Supervision for Few-Shot Transformer

Bowen Dong; Pan Zhou; Shuicheng Yan; Wangmeng Zuo

少数ショット変圧器の自己推進監視

ビジョントランスフォーマー（ViT）の数ショットの学習能力は、非常に望まれていますが、ほとんど調査されていません。この作業では、経験的に、同じ数ショットの学習フレームワーク、たとえば〜Meta-Baselineを使用して、広く使用されているCNN特徴抽出器をViTモデルに置き換えると、数ショットの分類パフォーマンスが大幅に低下することがわかります。さらに、私たちの経験的研究は、誘導バイアスがない場合、ViTは、少数のラベル付きトレーニングデータしか利用できない少数ショットの学習レジームの下で低資格のトークン依存関係を学習することが多く、これが上記のパフォーマンスの低下に大きく寄与することを示しています。この問題を軽減するために、初めて、ViT用のシンプルで効果的な数ショットのトレーニングフレームワーク、つまり自己促進型sUpervisioN（SUN）を提案します。具体的には、グローバルセマンティック学習のための従来のグローバル監視に加えて、SUNはさらに数ショット学習データセットでViTを事前トレーニングし、それを使用して各パッチトークンをガイドするための個別の場所固有の監視を生成します。この場所固有の監視は、どのパッチトークンが類似または非類似であるかをViTに通知し、トークンの依存関係の学習を加速します。さらに、各パッチトークンのローカルセマンティクスをモデル化して、オブジェクトの接地および認識機能を改善し、一般化可能なパターンの学習に役立てます。場所固有の監視の品質を向上させるために、さらに2つの手法を提案します。〜1）バックグラウンドパッチをフィルタリングして、バックグラウンドパッチを除外し、追加のバックグラウンドクラスに割り当てます。 2）生成されたローカル監視の精度を維持しながら、データ拡張に十分な多様性を導入するための空間的に一貫した拡張。実験結果は、ViTを使用するSUNが、ViTを使用する他の数ショットの学習フレームワークを大幅に上回り、CNNの最先端技術よりも高いパフォーマンスを実現する最初のフレームワークであることを示しています。

The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, e.g. ~Meta-Baseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques:~1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatial-consistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.

updated: Thu Jun 09 2022 05:12:46 GMT+0000 (UTC)

published: Mon Mar 14 2022 12:53:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト