Efficient Training of Visual Transformers with Small-Size Datasets

Yahui Liu; Enver Sangineto; Wei Bi; Nicu Sebe; Bruno Lepri; Marco De Nadai

小規模なデータセットを使用したビジュアルトランスフォーマーの効率的なトレーニング

ビジュアルトランスフォーマー (VT) は、畳み込みネットワーク (CNN) に代わるアーキテクチャパラダイムとして浮上しています。 CNN とは異なり、VT は画像要素間のグローバルな関係をキャプチャでき、潜在的により大きな表現能力を持っています。ただし、典型的な畳み込み誘導バイアスがないため、これらのモデルは一般的な CNN よりもデータを大量に消費します。実際、VT の CNN アーキテクチャ設計に埋め込まれている視覚ドメインのローカルプロパティのいくつかは、サンプルから学習する必要があります。この論文では、さまざまな VT を経験的に分析し、小さなトレーニングセットの体制でその堅牢性を比較します。さらに、ごくわずかな計算オーバーヘッドで画像から追加情報を抽出できる自己監視タスクを提案します。このタスクは、VT が画像内の空間的関係を学習することを奨励し、トレーニングデータが不足している場合に VT トレーニングをより堅牢にします。私たちのタスクは、標準 (教師あり) トレーニングと組み合わせて使用され、特定のアーキテクチャの選択に依存しないため、既存の VT に簡単に組み込むことができます。さまざまな VT とデータセットを使用した広範な評価を使用して、私たちの方法が VT の最終的な精度を (時には劇的に) 改善できることを示します。コードは承認されると利用可能になります。

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. The code will be available upon acceptance.

updated: Mon Jun 07 2021 16:14:06 GMT+0000 (UTC)

published: Mon Jun 07 2021 16:14:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト