Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

Roxana Daneshjou; Kailas Vodrahalli; Roberto A Novoa; Melissa Jenkins; Weixin Liang; Veronica Rotemberg; Justin Ko; Susan M Swetter; Elizabeth E Bailey; Olivier Gevaert; Pritam Mukherjee; Michelle Phung; Kiana Yekrang; Bradley Fong; Rachna Sahasrabudhe; Johan A. C. Allerup; Utako Okata-Karigane; James Zou; Albert Chiou

多様で精選された臨床画像セットでの皮膚科AIパフォーマンスの格差

皮膚科医療へのアクセスは大きな問題であり、推定30億人が世界的に医療へのアクセスを欠いています。人工知能（AI）は、皮膚病のトリアージに役立つ可能性があります。ただし、ほとんどのAIモデルは、さまざまな肌の色や珍しい病気の画像で厳密に評価されていません。このコンテキストでのアルゴリズムパフォーマンスの潜在的なバイアスを確認するために、多様な皮膚科画像（DDI）データセットをキュレートしました。この656枚の画像のデータセットを使用して、最先端の皮膚科AIモデルのパフォーマンスがDDIで大幅に低下し、受信者動作曲線下面積（ROC-AUC）がモデルと比較して27〜36％低下することを示します。元のテスト結果。すべてのモデルは、DDIデータセットに示されている、肌の色が濃い色や珍しい病気でパフォーマンスが低下しました。さらに、AIトレーニングとテストデータセットの視覚的ラベルを通常提供する皮膚科医も、グラウンドトゥルース生検注釈と比較して、暗い肌の色調と珍しい病気の画像でパフォーマンスが低下することがわかりました。最後に、よく特徴付けられた多様なDDI画像でAIモデルを微調整することで、明るい肌の色と暗い肌の色の間のパフォーマンスのギャップを埋めました。さらに、多様な肌の色調で微調整されたアルゴリズムは、暗い肌の色調の画像で悪性腫瘍を特定する際に皮膚科医よりも優れていました。私たちの調査結果は、多様な患者や病気への信頼できるアプリケーションを確保するために対処する必要がある皮膚科AIの重要な弱点とバイアスを識別します。

Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. Artificial intelligence (AI) may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and diverse DDI images closed the performance gap between light and dark skin tones. Moreover, algorithms fine-tuned on diverse skin tones outperformed dermatologists on identifying malignancy on images of dark skin tones. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and diseases.

updated: Tue Mar 15 2022 20:33:23 GMT+0000 (UTC)

published: Tue Mar 15 2022 20:33:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト