Spatial Entropy as an Inductive Bias for Vision Transformers

Elia Peruzzo; Enver Sangineto; Yahui Liu; Marco De Nadai; Wei Bi; Bruno Lepri; Nicu Sebe

ビジョントランスフォーマーの誘導バイアスとしての空間エントロピー

最近の研究では、ビジョントランスフォーマー (VT) のアテンションマップは、自己監督でトレーニングされた場合、トレーニングが監督されたときに自発的に出現しないセマンティックセグメンテーション構造を含むことができることが示されています。この論文では、この空間クラスタリングをトレーニング正則化の形式として出現させることを明示的に奨励します。この方法では、自己教師ありの口実タスクを標準の教師あり学習に含めます。より詳細には、情報エントロピーの空間定式化に基づく VT 正則化法を提案します。提案された空間エントロピーを最小化することにより、VT に空間的に順序付けられたアテンションマップを生成するように明示的に要求します。この方法では、トレーニング中にオブジェクトベースの事前分布が含まれます。広範な実験を使用して、提案された正則化アプローチがさまざまなトレーニングシナリオ、データセット、ダウンストリームタスク、および VT アーキテクチャで有益であることを示します。コードは承認時に利用可能になります。

Recent work has shown that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning. In more detail, we propose a VT regularization method based on a spatial formulation of the information entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT to produce spatially ordered attention maps, this way including an object-based prior during training. Using extensive experiments, we show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures. The code will be available upon acceptance.

updated: Tue Oct 04 2022 10:20:12 GMT+0000 (UTC)

published: Thu Jun 09 2022 17:34:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト