Optimally sequencing experimental assays in drug discovery is a high-stakes planning problem under severe uncertainty and resource constraints. A primary obstacle for standard reinforcement learning (RL) is the absence of an explicit environment simulator or transition data $(s, a, s')$; planning must rely solely on a static database of historical outcomes. We introduce the Implicit Bayesian Markov Decision Process (IBMDP), a model-based RL framework designed for such simulator-free settings. IBMDP constructs a case-guided implicit model of transition dynamics by forming a nonparametric belief distribution using similar historical outcomes. This mechanism enables Bayesian belief updating as evidence accumulates and employs ensemble MCTS planning to generate stable policies that balance information gain toward desired outcomes with resource efficiency. We validate IBMDP through comprehensive experiments. On a real-world central nervous system (CNS) drug discovery task, IBMDP reduced resource consumption by up to 92\% compared to established heuristics while maintaining decision confidence. To rigorously assess decision quality, we also benchmarked IBMDP in a synthetic environment with a computable optimal policy. Our framework achieves significantly higher alignment with this optimal policy than a deterministic value iteration alternative that uses the same similarity-based model, demonstrating the superiority of our ensemble planner. IBMDP offers a practical solution for sequential experimental design in data-rich but simulator-poor domains.
翻译:在药物发现中优化实验检测的序贯安排是一个高风险规划问题,面临严重的不确定性和资源约束。标准强化学习(RL)的主要障碍在于缺乏显式的环境模拟器或状态转移数据$(s, a, s')$;规划必须完全依赖于历史结果的静态数据库。我们提出了隐式贝叶斯马尔可夫决策过程(IBMDP),这是一个专为此类无模拟器场景设计的基于模型的RL框架。IBMDP通过利用相似历史结果构建非参数信念分布,从而建立案例引导的隐式状态转移动态模型。该机制支持随着证据积累进行贝叶斯信念更新,并采用集成MCTS规划来生成稳定的策略,以平衡面向期望结果的信息增益与资源效率。我们通过综合实验验证了IBMDP。在真实世界中枢神经系统(CNS)药物发现任务中,相较于既定启发式方法,IBMDP在保持决策置信度的同时将资源消耗降低了高达92%。为严格评估决策质量,我们还在具有可计算最优策略的合成环境中对IBMDP进行了基准测试。我们的框架与最优策略的一致性显著高于使用相同基于相似性模型的确定性值迭代替代方案,证明了我们集成规划器的优越性。IBMDP为数据丰富但模拟器稀缺领域的序贯实验设计提供了一个实用解决方案。