Data quality (DQ) and transparency of secondary data are critical factors that delay the adoption of clinical AI models and affect clinician trust in them. Many DQ studies fail to clarify where, along the lifecycle, quality checks occur, leading to uncertainty about provenance and fitness for reuse. This study develops a framework for transparent reporting of DQ assessments across the clinical electronic health record (EHR) data lifecycle. The reporting framework was developed through iterative analysis to identify actors and phases of the clinical data lifecycle. The framework distinguishes between data-generating organizations and data-receiving organizations to allow users to map DQ parameters to stages across the data lifecycle. The framework defines 5 key lifecycle phases and multiple actors. When applied to the real-world dataset, the framework demonstrated applicability in revealing where DQ issues may originate. The framework provides a structured approach for reporting DQ assessments, which can enhance transparency regarding data fitness for reuse, supporting reliable clinical research, AI model development, and internal organisational governance. This work provides practical guidance for researchers to understand data provenance and for organisations to target DQ improvement efforts across the data lifecycle.
翻译:数据质量(DQ)与二次数据的透明度是延缓临床人工智能模型应用并影响临床医生对其信任度的关键因素。许多数据质量研究未能明确质量检查在数据生命周期的哪个阶段进行,导致数据来源和复用适用性存在不确定性。本研究开发了一个用于在临床电子健康记录(EHR)数据生命周期中透明报告数据质量评估的框架。该报告框架通过迭代分析临床数据生命周期的参与主体与阶段而构建。框架区分了数据生成机构与数据接收机构,使用户能够将数据质量参数映射到数据生命周期的各个阶段。该框架定义了5个关键生命周期阶段及多个参与主体。在实际数据集的应用中,该框架展现了其在揭示数据质量问题潜在起源方面的适用性。该框架为报告数据质量评估提供了结构化方法,可增强数据复用适用性的透明度,从而支持可靠的临床研究、人工智能模型开发以及机构内部治理。本工作为研究人员理解数据来源、以及为机构在数据生命周期中有针对性地改进数据质量提供了实践指导。