Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification

Smriti Regmi; Aliza Subedi; Ulas Bagci; Debesh Jha

効率的な胸部 X 線および消化管画像分類のためのビジョントランスフォーマー

医用画像解析は、病気の早期診断や治療など、さまざまな臨床応用に役立つため、注目の研究テーマです。畳み込みニューラルネットワーク (CNN) は、利用可能なデータセットから複雑な特徴を学習できるため、医療画像解析タスクの事実上の標準となり、多くの画像理解タスクで人間を凌駕しています。 CNN に加えて、トランスフォーマーアーキテクチャも医用画像解析タスクで人気を博しています。しかし、この分野での進歩にもかかわらず、改善の余地がある分野がまだあります。この調査では、さまざまな CNN とトランスフォーマーベースの手法を使用し、さまざまなデータ拡張手法を使用しています。異なるモダリティからの 3 つの医用画像データセットに対するパフォーマンスを評価しました。ビジョントランスフォーマーモデルのパフォーマンスを評価し、他の最先端 (SOTA) の事前トレーニング済み CNN ネットワークと比較しました。胸部 X 線では、ビジョントランスフォーマーモデルは最高の F1 スコア 0.9532、リコール 0.9533、マシューズ相関係数 (MCC) 0.9259、ROC-AUC スコア 0.97 を達成しました。同様に、Kvasir データセットでは、F1 スコア 0.9436、リコール 0.9437、MCC 0.9360、ROC-AUC スコア 0.97 を達成しました。 Kvasir-Capsule (大規模な VCE データセット) の場合、ViT モデルは加重 F1 スコア 0.7156、リコール 0.7182、MCC 0.3705、ROC-AUC スコア 0.57 を達成しました。トランスフォーマーベースのモデルは、さまざまな解剖学的構造、所見、および異常を分類するために、さまざまな CNN モデルよりも優れているか、より効果的であることがわかりました。私たちのモデルは、CNN ベースのアプローチよりも改善されていることを示しており、アルゴリズム開発の新しいベンチマークアルゴリズムとして使用できることを示唆しています。

Medical image analysis is a hot research topic because of its usefulness in different clinical applications, such as early disease diagnosis and treatment. Convolutional neural networks (CNNs) have become the de-facto standard in medical image analysis tasks because of their ability to learn complex features from the available datasets, which makes them surpass humans in many image-understanding tasks. In addition to CNNs, transformer architectures also have gained popularity for medical image analysis tasks. However, despite progress in the field, there are still potential areas for improvement. This study uses different CNNs and transformer-based methods with a wide range of data augmentation techniques. We evaluated their performance on three medical image datasets from different modalities. We evaluated and compared the performance of the vision transformer model with other state-of-the-art (SOTA) pre-trained CNN networks. For Chest X-ray, our vision transformer model achieved the highest F1 score of 0.9532, recall of 0.9533, Matthews correlation coefficient (MCC) of 0.9259, and ROC-AUC score of 0.97. Similarly, for the Kvasir dataset, we achieved an F1 score of 0.9436, recall of 0.9437, MCC of 0.9360, and ROC-AUC score of 0.97. For the Kvasir-Capsule (a large-scale VCE dataset), our ViT model achieved a weighted F1-score of 0.7156, recall of 0.7182, MCC of 0.3705, and ROC-AUC score of 0.57. We found that our transformer-based models were better or more effective than various CNN models for classifying different anatomical structures, findings, and abnormalities. Our model showed improvement over the CNN-based approaches and suggests that it could be used as a new benchmarking algorithm for algorithm development.

updated: Sun Apr 23 2023 04:07:03 GMT+0000 (UTC)

published: Sun Apr 23 2023 04:07:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト