Efficient Training of Visual Transformers with Small Datasets

Yahui Liu; Enver Sangineto; Wei Bi; Nicu Sebe; Bruno Lepri; Marco De Nadai

小さなデータセットを使用したビジュアルトランスフォーマーの効率的なトレーニング

ビジュアルトランスフォーマー（VT）は、畳み込みネットワーク（CNN）に代わるアーキテクチャパラダイムとして登場しています。 CNNとは異なり、VTは画像要素間のグローバルな関係をキャプチャでき、潜在的に大きな表現能力を備えています。ただし、典型的な畳み込み誘導バイアスがないため、これらのモデルは一般的なCNNよりもデータを大量に消費します。実際、VTのCNNアーキテクチャ設計に埋め込まれているビジュアルドメインのいくつかのローカルプロパティは、サンプルから学習する必要があります。このホワイトペーパーでは、さまざまなVTを経験的に分析し、小さなトレーニングセット体制での堅牢性を比較します。ImageNetでトレーニングした場合の精度は同等ですが、小さなデータセットでのパフォーマンスは大きく異なる可能性があることを示しています。さらに、ごくわずかな計算オーバーヘッドで画像から追加情報を抽出できる自己監視タスクを提案します。このタスクは、VTが画像内の空間関係を学習することを奨励し、トレーニングデータが不足している場合にVTトレーニングをはるかに堅牢にします。私たちのタスクは、標準の（教師あり）トレーニングと一緒に使用され、特定のアーキテクチャの選択に依存しないため、既存のVTに簡単に接続できます。さまざまなVTとデータセットを使用した広範な評価を使用して、この方法でVTの最終的な精度を（場合によっては劇的に）改善できることを示します。コードはhttps://github.com/yhlleo/VTs-Drlocで入手できます。

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.

updated: Sun Nov 14 2021 21:16:01 GMT+0000 (UTC)

published: Mon Jun 07 2021 16:14:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト