Tabular data is often hidden in text, particularly in medical diagnostic reports. Traditional machine learning (ML) models designed to work with tabular data, cannot effectively process information in such form. On the other hand, large language models (LLMs) which excel at textual tasks, are probably not the best tool for modeling tabular data. Therefore, we propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. Drawing upon the reasoning capabilities of LLMs, TEMED-LLM goes beyond traditional extraction techniques, accurately inferring tabular features, even when their names are not explicitly mentioned in the text. This is achieved by combining domain-specific reasoning guidelines with a proposed data validation and reasoning correction feedback loop. By applying interpretable ML models such as decision trees and logistic regression over the extracted and validated data, we obtain end-to-end interpretable predictions. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics. Given its predictive performance, simplicity, and interpretability, TEMED-LLM underscores the potential of leveraging LLMs to improve the performance and trustworthiness of ML models in medical applications.
翻译:表格数据常隐藏于文本之中,尤其在医疗诊断报告中。传统面向表格数据的机器学习模型无法有效处理此类信息形式。另一方面,擅长文本任务的大语言模型可能并非建模表格数据的最佳工具。为此,我们提出一种新颖、简单且有效的方法——TEMED-LLM,用于从文本化医疗报告中提取结构化表格数据。该方法利用大语言模型的推理能力,超越传统提取技术,即使特征名称未在文本中明确提及,也能准确推断表格特征。通过将领域特定推理指南与所提出的数据验证及推理纠正反馈循环相结合,实现了这一目标。在提取并验证的数据上应用可解释机器学习模型(如决策树与逻辑回归),我们获得了端到端的可解释预测结果。实验证明,本方法在医疗诊断任务中显著优于现有最优文本分类模型。凭借其预测性能、简洁性与可解释性,TEMED-LLM彰显了利用大语言模型提升机器学习模型在医疗应用中的性能与可信赖度的潜力。