Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers

Kevin Miao; Akash Gokul; Raghav Singh; Suzanne Petryk; Joseph Gonzalez; Kurt Keutzer; Trevor Darrell; Colorado Reed

自己管理型ビジョントランスフォーマーにおける事前知識に基づく注意

自己教師あり表現学習の最近の傾向は、トレーニングパイプラインから帰納的バイアスを取り除くことに重点が置かれています。ただし、誘導バイアスは、利用可能なデータが限られている場合や、基になるデータ分布に関する追加の洞察を提供する場合に役立ちます。空間事前注意 (SPAN) を提示します。これは、ラベルのない画像データセットの一貫した空間的および意味的構造を利用して、Vision Transformer の注意を誘導するフレームワークです。 SPAN は、別々のトランスフォーマーヘッドからアテンションマスクを正則化して、セマンティック領域のさまざまな優先順位に従うことによって動作します。これらの事前確率は、データ統計またはドメインの専門家によって提供される単一のラベル付きサンプルから導き出すことができます。医用画像分析や視覚的品質保証など、いくつかの詳細な現実世界のシナリオを通じて SPAN を研究しています。結果として得られる注意マスクは、ドメインにとらわれない事前トレーニングから派生したものよりも解釈しやすいことがわかります。 SPAN は、肺と心臓のセグメンテーションで 58.7 mAP の改善をもたらします。また、事前トレーニング済みのモデルを下流の胸部疾患分類タスクに転送するときに、ドメインに依存しない事前トレーニングと比較して、この方法が 2.2 mAUC の改善をもたらすこともわかりました。最後に、SPAN 事前トレーニングが、ドメインにとらわれない事前トレーニングと比較して、データ量の少ない体制で下流の分類パフォーマンスを向上させることを示します。

Recent trends in self-supervised representation learning have focused on removing inductive biases from training pipelines. However, inductive biases can be useful in settings when limited data are available or provide additional insight into the underlying data distribution. We present spatial prior attention (SPAN), a framework that takes advantage of consistent spatial and semantic structure in unlabeled image datasets to guide Vision Transformer attention. SPAN operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or a single labeled sample provided by a domain expert. We study SPAN through several detailed real-world scenarios, including medical image analysis and visual quality assurance. We find that the resulting attention masks are more interpretable than those derived from domain-agnostic pretraining. SPAN produces a 58.7 mAP improvement for lung and heart segmentation. We also find that our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task. Lastly, we show that SPAN pretraining leads to higher downstream classification performance in low-data regimes compared to domain-agnostic pretraining.

updated: Wed Sep 07 2022 02:30:36 GMT+0000 (UTC)

published: Wed Sep 07 2022 02:30:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト