Generative Artificial Intelligence (AI) can be used to automatically generate medical reports based on transcripts of medical consultations. The aim is to reduce the administrative burden that healthcare professionals face. The accuracy of the generated reports needs to be established to ensure their correctness and usefulness. There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting. A comparative experimentation of 10 accuracy metrics has been performed on AI generated medical reports against their corresponding General Practitioner's (GP) medical reports concerning Otitis consultations. The number of missing, incorrect, and additional statements of the generated reports have been correlated with the metric scores. In addition, we introduce and define a Composite Accuracy Score which produces a single score for comparing the metrics within the field of automated medical reporting. Findings show that based on the correlation study and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics are the preferred metrics, which is not in line with previous work. These findings help determine the accuracy of an AI generated medical report, which aids the development of systems that generate medical reports for GPs to reduce the administrative burden.
翻译:生成式人工智能(AI)可用于基于医疗咨询转录自动生成医疗报告,旨在减轻医疗专业人员面临的行政负担。为确保生成报告的正确性和实用性,需建立其准确性度量标准。已有多种指标用于衡量AI生成报告的准确性,但鲜有研究将这些指标应用于医疗报告领域。针对耳炎咨询场景,我们对AI生成的医疗报告与对应全科医生(GP)的医疗报告进行了10项准确性指标的比较实验,并将生成报告中缺失、错误及额外陈述的数量与指标得分进行了相关性分析。此外,我们引入并定义了一种复合准确性评分,该评分可生成单一分数,用于比较自动医疗报告领域中的各项指标。研究结果表明,基于相关性分析与复合准确性评分,ROUGE-L和词移距离(Word Mover's Distance)指标为优选指标,这与以往研究结论不一致。这些发现有助于确定AI生成医疗报告的准确性,从而推动为全科医生生成医疗报告的系统的开发,以减轻其行政负担。