Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection

Yassine Barhoumi; Nidhal C. Bouaynaya; Ghulam Rasool

効率的な Scopeformer: 頭蓋内出血検出のためのスケーラブルで豊富な特徴抽出に向けて

畳み込みニューラルネットワーク (CNN) とビジョントランスフォーマー (ViT) によって抽出された特徴マップの品質と豊富さは、堅牢なモデルのパフォーマンスに直接関係しています。医療用コンピュータービジョンでは、これらの情報豊富な機能は、大規模なデータセット内のまれなケースを検出するために重要です。この作品は、コンピューター断層撮影 (CT) 画像における頭蓋内出血分類のための新しいマルチ CNN-ViT モデルである「Scopeformer」を提示します。 Scopeformer アーキテクチャはスケーラブルでモジュール式であるため、多様な出力機能と事前トレーニング戦略を備えたバックボーンとしてさまざまな CNN アーキテクチャを利用できます。 CNNで生成された特徴間の冗長性を減らし、ViTの入力サイズを制御するための効果的な特徴投影法を提案します。さまざまな Scopeformer モデルを使用した広範な実験により、モデルのパフォーマンスが特徴抽出器で使用される畳み込みブロックの数に比例することが示されています。 CNN の事前トレーニングパラダイムの多様化、さまざまな事前トレーニングデータセット、スタイル転送手法など、複数の戦略を使用して、さまざまな計算予算でのモデルパフォーマンスの全体的な改善を示します。後で、3 つの異なるタイプの入力および出力 ViT 構成を備えた、計算効率の高い Scopeformer のより小さいバージョンを提案します。 Efficient Scopeformers は、4 つの異なる事前トレーニング済み CNN アーキテクチャを特徴抽出器として使用して、特徴の豊富さを高めます。最高の Efficient Scopeformer モデルは、96.94% の精度と 0.083 の加重対数損失を達成し、基本の Scopeformer と比較してトレーニング可能なパラメーターの数が 8 分の 1 に減少しました。 Efficient Scopeformer モデルの別のバージョンでは、パラメータ空間がさらに約 17 倍縮小され、パフォーマンスの低下はほとんどありませんでした。ハイブリッド CNN と ViT は、正確な医療用コンピュータービジョンモデルの開発に必要な豊富な機能を提供する可能性があります

The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models

updated: Wed Feb 01 2023 03:51:27 GMT+0000 (UTC)

published: Wed Feb 01 2023 03:51:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト