Large-scale data extraction from the UNOS organ donor documents

Marek Rychlik; Bekir Tanriover; Yan Han

UNOS 臓器提供文書からの大規模データ抽出

私たちの研究の範囲は、2008 年以降の米国の臓器提供者のすべての UNOS データです。データは「添付ファイル」として知られる PDF 文書に取り込まれていたため、これまで大規模に分析することはできませんでした。そのため、各提供者は数十の臓器によって表されていました。異種形式の PDF ドキュメント。データを分析可能にするには、これらの PDF 内のコンテンツを標準 SQL データベースなどの分析可能なデータ形式に変換する必要があります。このホワイトペーパーでは、数百万ページにわたる約 400,000 の PDF ドキュメントで構成される 2022 年の UNOS データに焦点を当てます。 UNOS データの合計は 15 年間 (2008 ～ 20022 年) をカバーしており、私たちの結果はすぐにデータ全体に拡張される予定です。私たちの方法は、DCD フローシートのデータの一部、腎臓灌流データ、および患者の入院中に収集されたデータ (バイタルサイン、人工呼吸器の設定など) を収集します。この文書は、読者が UNOS データの内容に精通していることを前提としています。データの種類とそれが示す課題の概要については、別の論文で説明します。ここでは、UNOS 文書から包括的で分析可能なデータベースを構築するという目標が達成可能なタスクであることを実証することに焦点を当て、その方法論の概要を説明します。このプロジェクトにより、この準備段階でも以前に利用可能であったものよりもはるかに大きなデータセットが得られました。

The scope of our study is all UNOS data of the USA organ donors since 2008. The data is not analyzable in a large scale in the past because it was captured in PDF documents known as "Attachments", whereby every donor is represented by dozens of PDF documents in heterogenous formats. To make the data analyzable, one needs to convert the content inside these PDFs to an analyzable data format, such as a standard SQL database. In this paper we will focus on 2022 UNOS data comprised of ≈400,000 PDF documents spanning millions of pages. The totality of UNOS data covers 15 years (2008--20022) and our results will be quickly extended to the entire data. Our method captures a portion of the data in DCD flowsheets, kidney perfusion data, and data captured during patient hospital stay (e.g. vital signs, ventilator settings, etc.). The current paper assumes that the reader is familiar with the content of the UNOS data. The overview of the types of data and challenges they present is a subject of another paper. Here we focus on demonstrating that the goal of building a comprehensive, analyzable database from UNOS documents is an attainable task, and we provide an overview of our methodology. The project resulted in datasets by far larger than previously available even in this preliminary phase.

updated: Tue Jan 02 2024 23:39:10 GMT+0000 (UTC)

published: Wed Aug 30 2023 04:29:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト