Algorithmic encoding of protected characteristics in image-based models for disease detection

Ben Glocker; Charles Jones; Melanie Bernhardt; Stefan Winzeck

病気の検出のための画像ベースのモデルにおける保護された特性のアルゴリズムによる符号化

臨床的意思決定にAIを使用すると、健康格差が拡大する可能性があることが正しく強調されています。アルゴリズムは、保護された特性をエンコードし、（履歴）トレーニングデータの望ましくない相関関係のために、この情報を使用して予測を行うことができます。そのような情報が実際に使用されているかどうかをどのように確認できるかは不明です。十分なサービスを受けていない母集団からのデータが不足していることに加えて、データセットのバイアスが予測モデルにどのように現れるか、そしてこれがどのように異なるパフォーマンスをもたらすかについてはほとんどわかっていません。この記事は、画像ベースの疾患検出モデルにおけるサブグループ分析の新しい方法論を探求することにより、これらの問題に光を当てることを目的としています。公開されている2つの胸部X線データセット、CheXpertとMIMIC-CXRを利用して、深層学習モデルにおける人種および生物学的性別のパフォーマンスの不一致を調査します。テストセットのリサンプリング、転移学習、マルチタスク学習、およびモデル検査を調査して、保護された特性のエンコードとサブグループ全体の疾患検出パフォーマンスとの関係を評価します。テストセットの母集団と有病率のシフトを補正した後に部分的に削除された、シフトされた真陽性率と偽陽性率の観点からサブグループの不一致を確認します。さらに、以前に使用された転移学習方法は、特定の患者情報が予測を行うために使用されるかどうかを確立するには不十分であることがわかります。テストセットのリサンプリング、マルチタスク学習、およびモデル検査の提案された組み合わせは、保護された特性がディープニューラルネットワークの特徴表現でエンコードされる方法についての貴重な新しい洞察を明らかにします。

It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring new methodology for subgroup analysis in image-based disease detection models. We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups. We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We further find a previously used transfer learning method to be insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable new insights about the way protected characteristics are encoded in the feature representations of deep neural networks.

updated: Thu Jul 21 2022 15:33:21 GMT+0000 (UTC)

published: Wed Oct 27 2021 20:30:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト