Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach. We employ a student-teacher architecture, utilizing Zephyr-7b-beta and Phi3-mini-128 as student models and GPT-4o as the teacher, to dynamically select between prompt refinement, Retrieval-Augmented Generation (RAG), and fine-tuning strategies. Our experiments on 294 clinical notes covering 12 post-radiotherapy toxicity symptoms demonstrate the effectiveness of this approach. The RAG method proved most efficient, improving average accuracy scores from 0.32 to 0.73 for Zephyr-7b-beta and from 0.40 to 0.87 for Phi3-mini-128 during refinement. In the test set, both models showed an approximate 0.20 increase in accuracy across symptoms. Notably, this improvement was achieved at a cost 45 times lower than GPT-4o for Zephyr and 79 times lower for Phi-3. These results highlight the potential of iterative refinement techniques in enhancing the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.
翻译:大语言模型(LLMs)在临床症状提取方面展现出巨大潜力,但其在医疗场景中的部署受到隐私担忧、计算资源限制和运营成本的制约。本研究探索了一种新颖的迭代精炼方法,以优化紧凑型LLMs用于癌症毒性症状提取。我们采用师生架构,以Zephyr-7b-beta和Phi3-mini-128作为学生模型,GPT-4o作为教师模型,动态选择提示词精炼、检索增强生成(RAG)和微调策略。我们在涵盖12种放疗后毒性症状的294份临床记录上进行的实验验证了该方法的有效性。RAG方法被证明效率最高,在精炼过程中将Zephyr-7b-beta的平均准确率从0.32提升至0.73,将Phi3-mini-128的平均准确率从0.40提升至0.87。在测试集中,两种模型在所有症状上的准确率均实现了约0.20的提升。值得注意的是,这一性能提升的成本仅为GPT-4o的1/45(Zephyr)和1/79(Phi-3)。这些结果表明,迭代精炼技术能够有效增强紧凑型LLMs在临床应用中的能力,为医疗场景提供了性能、成本效益与隐私保护之间的平衡方案。