Problems and shortcuts in deep learning for screening mammography

Trevor Tsue; Brent Mombourquette; Ahmed Taha; Thomas Paul Matthews; Yen Nhi Truong Vu; Jason Su

マンモグラフィ検診における深層学習の問題点と近道

この作業は、ディープラーニングモデルのパフォーマンスと一般化可能性における未発見の課題を明らかにします。私たちは、(1) パフォーマンスを膨らませる偽の近道と評価の問題を特定し、(2) それらに対処するためのトレーニングと分析方法を提案します。 2008 年から 2017 年に取得された 120,112 件の米国の検査 (3,467 件のがん) と 2011 年から 2015 年に取得された 16,693 件の英国の検査 (5,655 件のがん) のレトロスペクティブデータセットでがんを分類する AI モデルをトレーニングしました。検査 (102 の癌; 7,594 人の女性; 年齢 57.1 ± 11.0) および 1,880 の英国の検査 (590 の癌; 1,745 人の女性; 年齢 63.3 ± 7.2)。ビューマーカーのみ (乳房なし) の画像でトレーニングされたモデルは、0.691 AUC を達成しました。両方のデータセットでトレーニングされた元のモデルは、米国と英国のデータセットを組み合わせて 0.945 AUC を達成しましたが、逆説的に、米国と英国のデータセットではそれぞれ 0.838 と 0.892 しかありませんでした。トレーニング中に両方のデータセットから癌を均等にサンプリングすると、この近道が軽減されました。同様の AUC パラドックス (0.903) は、診断検査とスクリーニング検査を評価するときに発生しました (それぞれ 0.862 対 0.861)。トレーニング中に診断検査を削除すると、このバイアスが緩和されました。最後に、モデルはスキャナーモデルよりも AUC のパラドックスを示さなかったが、Hologic Selenia (HS) 試験よりも Selenia Dimension (SD) への偏りを示した。分析によると、この AUC パラドックスは、データセット属性の値がより高いがん罹患率 (データセットバイアス) を持ち、その結果、モデルがこれらの属性値により高い確率を割り当てた場合 (モデルバイアス) に発生することが示されました。がんの有病率を階層化してバランスをとることで、評価中のショートカットを軽減できます。データセットとモデルのバイアスは、近道や AUC パラドックスを導入する可能性があり、ヘルスケア AI スペース内で問題が蔓延する可能性があります。私たちの方法は、パフォーマンスを明確に理解しながら、ショートカットを検証および軽減できます。

This work reveals undiscovered challenges in the performance and generalizability of deep learning models. We (1) identify spurious shortcuts and evaluation issues that can inflate performance and (2) propose training and analysis methods to address them. We trained an AI model to classify cancer on a retrospective dataset of 120,112 US exams (3,467 cancers) acquired from 2008 to 2017 and 16,693 UK exams (5,655 cancers) acquired from 2011 to 2015. We evaluated on a screening mammography test set of 11,593 US exams (102 cancers; 7,594 women; age 57.1 ±11.0) and 1,880 UK exams (590 cancers; 1,745 women; age 63.3 ±7.2). A model trained on images of only view markers (no breast) achieved a 0.691 AUC. The original model trained on both datasets achieved a 0.945 AUC on the combined US+UK dataset but paradoxically only 0.838 and 0.892 on the US and UK datasets, respectively. Sampling cancers equally from both datasets during training mitigated this shortcut. A similar AUC paradox (0.903) occurred when evaluating diagnostic exams vs screening exams (0.862 vs 0.861, respectively). Removing diagnostic exams during training alleviated this bias. Finally, the model did not exhibit the AUC paradox over scanner models but still exhibited a bias toward Selenia Dimension (SD) over Hologic Selenia (HS) exams. Analysis showed that this AUC paradox occurred when a dataset attribute had values with a higher cancer prevalence (dataset bias) and the model consequently assigned a higher probability to these attribute values (model bias). Stratification and balancing cancer prevalence can mitigate shortcuts during evaluation. Dataset and model bias can introduce shortcuts and the AUC paradox, potentially pervasive issues within the healthcare AI space. Our methods can verify and mitigate shortcuts while providing a clear understanding of performance.

updated: Wed Mar 29 2023 02:50:59 GMT+0000 (UTC)

published: Wed Mar 29 2023 02:50:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト