Spatial Entropy as an Inductive Bias for Vision Transformers

Yahui Liu; Enver Sangineto; Yajing Chen; Linchao Bao; Haoxian Zhang; Nicu Sebe; Bruno Lepri; Marco De Nadai

ビジョントランスフォーマーの誘導バイアスとしての空間エントロピー

ビジョントランスフォーマー (VT) に関する最近の研究では、VT アーキテクチャに局所誘導バイアスを導入すると、トレーニングに必要なサンプル数を減らすことができることが示されました。ただし、アーキテクチャの変更により、Transformer バックボーンの一般性が失われ、たとえばコンピュータビジョンと自然言語処理の両方の分野で共有される、統一されたアーキテクチャの開発への推進とは部分的に矛盾します。この作業では、標準的な教師付きトレーニングと一緒に実行される、補助的な自己教師付きタスクを使用してローカルバイアスが導入される、異なる補完的な方向性を提案します。具体的には、VT のアテンションマップは、自己監督でトレーニングすると、トレーニングが監督されたときに自発的に出現しないセマンティックセグメンテーション構造を含むことができるという観察を利用します。したがって、トレーニングの正則化の一形態として、この空間クラスタリングの出現を明示的に奨励します。より詳細には、特定の画像では、オブジェクトは通常、いくつかの接続された領域に対応するという仮定を利用し、このオブジェクトベースの誘導バイアスを定量化するための情報エントロピーの空間定式化を提案します。提案された空間エントロピーを最小化することにより、トレーニング中に追加の自己監視信号を含めます。広範な実験を使用して、提案された正則化が、基本的な Transformer アーキテクチャを変更することにより、ローカルバイアスを含む他の VT 提案と同等またはより良い結果をもたらし、小規模から中規模のトレーニングセットを使用する場合に VT の最終的な精度を大幅に高めることができることを示します。コードは https://github.com/helia95/SAR で入手できます。

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

updated: Fri Mar 03 2023 10:37:00 GMT+0000 (UTC)

published: Mon Oct 03 2022 11:57:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト