Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.
翻译:国际疾病分类(ICD)的自动编码是医疗计费、流行病学和临床决策支持中的核心医学编码任务。生成式大语言模型(LLMs)常被认为在医学编码方面表现欠佳,但这一结论主要来源于推理阶段的设置(如提示工程、检索、重排序或工具使用),而任务特定的后训练作用尚未得到充分探索。我们针对生成式ICD编码的后训练开展了一项受控实证研究,在统一协议和指标集下,对比了判别式基线模型与基于提示工程、监督微调及强化学习的LLM编码器。据我们所知,这是首个在ICD编码中评估基于强化学习的后训练对生成式LLM编码器效果的研究。我们还提出了PHI——一种诊断性课程策略,它扩展了GRPO以优化遗漏编码案例。结果表明,仅基于提示工程的评估会显著低估LLM在ICD编码中的潜力。监督微调带来了主要的能力跃升,GRPO在监督微调基础上进一步提升了编码集预测性能,而PHI则在宏观性能上实现了针对性改进。这些发现表明,主要瓶颈并非生成式范式本身,而是如何针对完整分类体系的召回率对模型进行适配和优化。我们在https://github.com/AlexandreWANG915/LLM4ICD 上开源了代码、数据划分和检查点。