Reinforcement Learning (RL) offers a powerful framework for optimizing dynamic treatment regimes (DTRs). However, clinical RL is fundamentally bottlenecked by reward engineering: the challenge of defining signals that safely and effectively guide policy learning in complex, sparse offline environments. Existing approaches often rely on manual heuristics that fail to generalize across diverse pathologies. To address this, we propose an automated pipeline leveraging Large Language Models (LLMs) for offline reward design and verification. We formulate the reward function using potential functions consisted of three core components: survival, confidence, and competence. We further introduce quantitative metrics to rigorously evaluate and select the optimal reward structure prior to deployment. By integrating LLM-driven domain knowledge, our framework automates the design of reward functions for specific diseases while significantly enhancing the performance of the resulting policies.
翻译:强化学习(RL)为优化动态治疗方案(DTRs)提供了强大的框架。然而,临床强化学习从根本上受限于奖励工程:即在复杂、稀疏的离线环境中,定义能够安全有效指导策略学习的信号这一挑战。现有方法通常依赖于难以在不同病理间泛化的手动启发式规则。为解决此问题,我们提出一种利用大型语言模型(LLMs)进行离线奖励设计与验证的自动化流程。我们使用由三个核心组件——生存性、置信度与能力——构成的势函数来形式化奖励函数。我们进一步引入定量指标,以便在部署前严格评估并选择最优的奖励结构。通过整合LLM驱动的领域知识,我们的框架能够针对特定疾病自动化设计奖励函数,同时显著提升最终策略的性能。