SiT: Self-supervised vIsion Transformer

Sara Atito; Muhammad Awais; Josef Kittler

SiT：自己教師ありビジョントランスフォーマー

教師あり学習の方法は、教師あり学習とのギャップを減らすことに最近成功したため、コンピュータービジョンでますます勢いを増しています。自然言語処理（NLP）では、自己教師あり学習とトランスフォーマーがすでに選択されています。最近の文献は、変圧器がコンピュータビジョンでもますます人気が高まっていることを示唆しています。これまでのところ、ビジョントランスフォーマーは、大規模な教師ありデータを使用するか、教師ネットワークなどの何らかの共同監督を使用して事前トレーニングすると、うまく機能することが示されています。これらの監視された事前トレーニング済みビジョントランスフォーマーは、最小限の変更でダウンストリームタスクで非常に優れた結果を達成します。この作業では、画像/ビジョントランスフォーマーを事前トレーニングし、それらをダウンストリーム分類タスクに使用するための自己教師あり学習のメリットを調査します。自己教師ありvIsionTransformers（SiT）を提案し、口実モデルを取得するためのいくつかの教師ありトレーニングメカニズムについて説明します。 SiTのアーキテクチャ上の柔軟性により、SiTをオートエンコーダーとして使用し、複数の自己監視タスクをシームレスに処理できます。事前にトレーニングされたSiTは、数百万ではなく数千の画像で構成される小規模データセットのダウンストリーム分類タスク用に微調整できることを示します。提案されたアプローチは、一般的なプロトコルを使用して標準データセットで評価されます。結果は、変圧器の強度と自己教師あり学習への適合性を示しています。既存の教師あり学習方法を大幅に上回りました。また、SiTは少数のショット学習に適していることを確認し、SiTから学習した特徴に加えて線形分類器をトレーニングするだけで有用な表現を学習していることも示しました。事前トレーニング、微調整、および評価コードは、https：//github.com/Sara-Ahmed/SiTで入手できます。

Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT.

updated: Sun Nov 14 2021 15:17:23 GMT+0000 (UTC)

published: Thu Apr 08 2021 08:34:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト