The scope of our study is all UNOS data of the USA organ donors since 2008. The data is not analyzable in a large scale in the past because it was captured in PDF documents known as "Attachments", whereby every donor is represented by dozens of PDF documents in heterogenous formats. To make the data analyzable, one needs to convert the content inside these PDFs to an analyzable data format, such as a standard SQL database. In this paper we will focus on 2022 UNOS data comprised of $\approx 400,000$ PDF documents spanning millions of pages. The totality of UNOS data covers 15 years (2008--20022) and our results will be quickly extended to the entire data. Our method captures a portion of the data in DCD flowsheets, kidney perfusion data, and data captured during patient hospital stay (e.g. vital signs, ventilator settings, etc.). The current paper assumes that the reader is familiar with the content of the UNOS data. The overview of the types of data and challenges they present is a subject of another paper. Here we focus on demonstrating that the goal of building a comprehensive, analyzable database from UNOS documents is an attainable task, and we provide an overview of our methodology. The project resulted in datasets by far larger than previously available even in this preliminary phase.
翻译:本研究的范围涵盖美国自2008年以来的全部UNOS器官捐献者数据。由于这些数据以名为"附件"的PDF文档形式存储,每位捐献者对应数十份格式异构的PDF文件,因此过去无法进行大规模分析。为使数据具备可分析性,需要将这些PDF中的内容转换为可分析的数据格式,例如标准SQL数据库。本文重点关注2022年UNOS数据,包含约40万份PDF文档,涉及数百万页内容。UNOS全部数据覆盖15年(2008-2022年),我们的研究成果将快速扩展至整体数据。我们的方法可提取DCD流程表、肾脏灌注数据及患者住院期间数据(如生命体征、呼吸机参数等)中的部分信息。本文假设读者已熟悉UNOS数据内容。数据类型的概述及其面临的挑战将另文阐述,本文主要论证从UNOS文档构建全面可分析数据库这一目标的可实现性,并提供方法论概述。该项目的初步阶段即已生成远超既往规模的数据集。