Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks

Neil Band; Tim G. J. Rudner; Qixuan Feng; Angelos Filos; Zachary Nado; Michael W. Dusenberry; Ghassen Jerfel; Dustin Tran; Yarin Gal

糖尿病性網膜症検出タスクでのベイズ深層学習のベンチマーク

ベイジアンディープラーニングは、予測の不確実性を正確に定量化する機能をディープニューラルネットワークに装備することを目指しており、安全性が重要な実世界のアプリケーションに対してディープラーニングの信頼性を高めることを約束しています。しかし、既存のベイジアンディープラーニング手法は、この期待に応えられません。新しい方法は、信頼できる不確実性の定量化から最も恩恵を受ける下流の実世界のタスクの複雑さを反映していない非現実的なテストベッドで評価され続けています。このような複雑さを正確に反映し、安全性が重要なシナリオで予測モデルの信頼性を評価するように設計された一連の現実世界のタスクである RETINA Benchmark を提案します。具体的には、失明に至る可能性のある病状である糖尿病性網膜症のさまざまな程度を示す高解像度の人間の網膜画像の 2 つの公開されているデータセットをキュレートし、それらを使用して、信頼性の高い予測不確実性の定量化を必要とする一連の自動診断タスクを設計します。これらのタスクを使用して、確立された最先端のベイジアンディープラーニング手法をタスク固有の評価指標でベンチマークします。再現性とソフトウェア設計の原則に従って、迅速かつ簡単にベンチマークを行うための使いやすいコードベースを提供します。ベンチマークに含まれるすべてのメソッドの実装と、100 日間の TPU、20 日間の GPU、400 のハイパーパラメータ構成、およびそれぞれ少なくとも 6 つのランダムシードでの評価で計算された結果を提供します。

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose the RETINA Benchmark, a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.

updated: Wed Nov 23 2022 05:44:42 GMT+0000 (UTC)

published: Wed Nov 23 2022 05:44:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト