Surgical Aggregation: A Federated Learning Framework for Harmonizing Distributed Datasets with Diverse Tasks

Pranav Kulkarni; Adway Kanhere; Paul H. Yi; Vishwa S. Parekh

外科的集約: 分散データセットを多様なタスクと調和させるためのフェデレーテッドラーニングフレームワーク

多くの大規模な胸部 X 線データセットは、ディープラーニングを使用して異常を検出するためにキュレーションされており、多くの臨床アプリケーションで大きなメリットをもたらす可能性があります。ただし、これらのデータセットは、存在する可能性のある疾患ラベルのサブセットの検出に焦点を当てているため、臨床的有用性が制限されています。さらに、これらのデータセットの分散性と、データ共有の規制により、疾患ラベルの完全な表現を共有して作成することが難しくなっています。そのために、さまざまな疾患ラベルを持つ分散データセットからの知識を「グローバル」深層学習モデルに集約および調和させるためのフェデレーテッドラーニングフレームワークである外科的集約を提案します。 NIH (14 ラベル) と CheXpert (13 ラベル) のデータセットを 20 の固有の疾患ラベルすべてを予測する能力を持つグローバルモデルに調和させるために外科的集計を利用し、両方のデータセットで個別にトレーニングされた「ベースライン」モデルのパフォーマンスと比較しました。ベースラインの平均 AUROC 0.81 および 0.71 と比較すると、グローバルモデルは、両方のデータセットからの保留テストセット全体で平均 AUROC がそれぞれ 0.75 および 0.74 であることがわかりました。 MIMIC 外部テストセットでは、ベースラインモデルの平均 AUROC がそれぞれ 0.74 および 0.76 であるのと比較して、グローバルモデルの一般化可能性が平均 0.80 であることがわかりました。私たちの結果は、さまざまなタスクを持つ分散データセットから知識を集約することにより、外科的集約が臨床的に有用な深層学習モデルを開発する可能性があることを示しています。これは、ベンチからベッドサイドまでのギャップを埋めるための一歩です。

Many large-scale chest x-ray datasets have been curated for the detection of abnormalities using deep learning, with the potential to provide substantial benefits across many clinical applications. However, these datasets focus on detecting a subset of disease labels that could be present, thus limiting their clinical utility. Furthermore, the distributed nature of these datasets, along with data sharing regulations, makes it difficult to share and create a complete representation of disease labels. To that end, we propose surgical aggregation, a federated learning framework for aggregating and harmonizing knowledge from distributed datasets with different disease labels into a 'global' deep learning model. We utilized surgical aggregation to harmonize the NIH (14 labels) and CheXpert (13 labels) datasets into a global model with the ability to predict all 20 unique disease labels and compared it to the performance of 'baseline' models trained individually on both datasets. We observed that the global model resulted in excellent performance across held-out test sets from both datasets with an average AUROC of 0.75 and 0.74 respectively when compared to the baseline average AUROC of 0.81 and 0.71. On the MIMIC external test set, we observed that the global model had better generalizability with average AUROC of 0.80, compared to the average AUROC of 0.74 and 0.76 respectively for the baseline models. Our results show that surgical aggregation has the potential to develop clinically useful deep learning models by aggregating knowledge from distributed datasets with diverse tasks -- a step forward towards bridging the gap from bench to bedside.

updated: Fri Feb 17 2023 14:11:18 GMT+0000 (UTC)

published: Tue Jan 17 2023 03:53:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト