Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs

Finn Behrendt; Debayan Bhattacharya; Julia Krüger; Roland Opfer; Alexander Schlaefer

胸部 X 線写真のマルチラベル疾患分類のためのデータ効率の良いビジョントランスフォーマー

レントゲン写真は、病状の検出と評価、治療計画、または臨床介入におけるナビゲーションとローカリゼーションの目的のための多用途の診断ツールです。ただし、放射線科医によるそれらの解釈と評価は、退屈でエラーが発生しやすい可能性があります。このように、放射線医が X 線写真を解釈するのをサポートするために、さまざまな深層学習手法が提案されています。ほとんどの場合、これらのアプローチは畳み込みニューラルネットワーク (CNN) に依存して画像から特徴を抽出します。特に、胸部 X 線写真 (胸部 X 線、CXR) での病状のマルチラベル分類では、CNN が適していることが証明されています。反対に、ビジョントランスフォーマー (ViTs) は、一般的な画像と解釈可能なローカル顕著性マップに対する高い分類性能にもかかわらず、このタスクには適用されていません。 ViT は畳み込みには依存しませんが、パッチベースの自己注意に依存しており、CNN とは対照的に、ローカル接続の事前知識は存在しません。これは容量の増加につながりますが、ViT は通常、大量のトレーニングデータを必要とし、大規模な医療データセットの収集には高コストが伴うため、医療分野ではハードルとなっています。この作業では、さまざまなデータセットサイズの ViT と CNN の分類パフォーマンスを体系的に比較し、よりデータ効率の高い ViT バリアント (DeiT) を評価します。私たちの結果は、ViT と CNN の間のパフォーマンスは ViT のわずかな利点と同等ですが、トレーニングにかなり大きなデータセットが利用できる場合、DeiT は前者よりも優れていることを示しています。

Radiographs are a versatile diagnostic tool for the detection and assessment of pathologies, for treatment planning or for navigation and localization purposes in clinical interventions. However, their interpretation and assessment by radiologists can be tedious and error-prone. Thus, a wide variety of deep learning methods have been proposed to support radiologists interpreting radiographs. Mostly, these approaches rely on convolutional neural networks (CNN) to extract features from images. Especially for the multi-label classification of pathologies on chest radiographs (Chest X-Rays, CXR), CNNs have proven to be well suited. On the Contrary, Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images and interpretable local saliency maps which could add value to clinical interventions. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. While this leads to increased capacity, ViTs typically require an excessive amount of training data which represents a hurdle in the medical domain as high costs are associated with collecting large medical data sets. In this work, we systematically compare the classification performance of ViTs and CNNs for different data set sizes and evaluate more data-efficient ViT variants (DeiT). Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.

updated: Wed Aug 17 2022 09:07:45 GMT+0000 (UTC)

published: Wed Aug 17 2022 09:07:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト