A comparative study between vision transformers and CNNs in digital pathology

Luca Deininger; Bernhard Stimpel; Anil Yuce; Samaneh Abbasi-Sureshjani; Simon Schönenberger; Paolo Ocampo; Konstanty Korski; Fabien Gaire

デジタルパソロジーにおけるビジョントランスフォーマーとCNNの比較研究

最近、ビジョントランスフォーマーは、十分な量のデータで事前トレーニングされた場合に、畳み込みニューラルネットワークよりも優れたパフォーマンスを発揮できることが示されました。畳み込みニューラルネットワークと比較して、ビジョントランスフォーマーは誘導バイアスが弱いため、より柔軟な特徴検出が可能になります。それらの有望な特徴検出のために、この作業は、4つの組織タイプのデジタル病理学全体のスライド画像における腫瘍検出、および組織タイプの識別のためのビジョントランスフォーマーを調査します。ビジョントランスフォーマーDeiT-Tinyのパッチごとの分類パフォーマンスを、最先端の畳み込みニューラルネットワークResNet18と比較しました。注釈付きのスライド全体の画像がまばらに利用できるため、最先端の自己監視アプローチを使用して、ラベルのない大量のスライド全体の画像で事前トレーニングされた両方のモデルをさらに比較しました。結果は、腫瘍検出の4つの組織タイプのうち3つでビジョントランスフォーマーがResNet18よりもわずかに優れているのに対し、残りのタスクではResNet18がわずかに優れていることを示しています。スライドレベルでの両方のモデルの集計された予測は相関しており、モデルが同様のイメージング機能をキャプチャしたことを示しています。全体として、ビジョントランスフォーマーモデルはResNet18と同等のパフォーマンスを発揮しましたが、トレーニングにはより多くの労力が必要でした。畳み込みニューラルネットワークのパフォーマンスを超えるために、ビジョントランスフォーマーは、弱い誘導バイアスの恩恵を受けるために、より困難なタスクを必要とする場合があります。

Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.

updated: Wed Jun 01 2022 10:41:11 GMT+0000 (UTC)

published: Wed Jun 01 2022 10:41:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト