Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
翻译:中医药(TCM)呈现出一个丰富且结构独特的知识体系,这对大语言模型(LLMs)的传统应用提出了挑战。尽管先前针对中医药的大语言模型通过监督微调已显示出进展,但它们通常在模型对齐、数据质量和评估一致性方面面临局限。本研究介绍了Ladder-base,这是首个采用群体相对策略优化(GRPO)训练的中医药大语言模型。GRPO是一种强化学习方法,通过基于组内比较优化回答选择,以提升推理能力和事实一致性。Ladder-base基于Qwen2.5-7B-Instruct基础模型构建,并专门在TCM-Ladder基准的文本子集上进行训练,其中80%的数据用于训练,剩余20%的数据平均分配给验证集和测试集。通过标准化评估,与最先进的通用大语言模型(如GPT-4、Gemini 2.5、Claude 3和Qwen3)以及领域特定的中医药模型(包括BenTsao、HuatuoGPT2和Zhongjing)相比,Ladder-base在多项推理指标上均表现出优越性能。这些发现表明,GRPO为在大语言模型中实现与传统医学领域专家级推理的对齐提供了一种有效且高效的策略,并支持开发可信赖且基于临床实践的中医药人工智能系统。