The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment.
翻译:真实世界数据的快速数字化为优化医疗服务和加速生物医学发现带来了前所未有的机遇。然而在实践中,此类数据大多以非结构化形式存在,如电子病历中的临床记录,且普遍受混杂因素影响。本文提出TRIALSCOPE——一个从群体水平观测数据中提炼真实世界证据的统一框架。该框架利用生物医学语言模型大规模结构化临床文本,采用先进概率建模进行去噪与插补,并整合最先进的因果推断技术以应对常见混杂因素。通过将临床试验规范作为通用表征,TRIALSCOPE提供了一套即用型解决方案,可利用观测数据生成临床假设并进行推理。我们在来自美国大型医疗网络的百万级癌症患者真实世界数据集上开展广泛实验与分析,结果表明TRIALSCOPE可对真实世界数据进行高质量结构化,并生成与标志性癌症临床试验相当的结果。除支持计算机模拟临床试验设计与优化外,TRIALSCOPE还可用于赋能合成对照、务实性试验、上市后监测,以及在精准诊疗中实现细粒度的"相似患者"推理。