Sequential Bayesian experimental design is often formulated as a fixed-horizon policy optimization problem, in which the number of experiments is specified before data collection begins. In practical campaigns, however, additional measurements may provide diminishing information relative to their cost, making termination an integral part of experimental design. Common threshold-based stopping rules are easy to implement but myopic, because they compare the current state with a fixed criterion rather than the expected value of future experiments. This work develops a Bayesian optimal stopping framework for sequential experimental design by treating design and stopping as coupled decisions in a finite-horizon sequential decision problem. We prove that, for any fixed design policy, the optimal stopping rule terminates when the immediate terminal reward is no smaller than the expected continuation value. We then derive a policy-gradient method for learning continuous design policies with value-based stopping. The resulting optimization is challenging because the design policy, continuation value, and stopping boundary are mutually dependent, and naïve training can become trapped in early-stopping local optima. To address this difficulty, we introduce a curriculum strategy that gradually transitions from forced continuation to adaptive stopping during training. Numerical studies on a linear-Gaussian benchmark, a nonlinear test case, and a contaminant source detection problem show that the proposed approach learns stable, resource-aware design-stopping policies, with the largest gains in settings with strong sequential dependence.
翻译:序贯贝叶斯实验设计常被建模为固定时间跨度的策略优化问题,其中实验次数在数据收集开始前即已确定。然而在实际研究中,额外测量可能提供相对于其成本递减的信息,这使得终止机制成为实验设计的核心环节。常用的基于阈值的停止规则易于实现但具有短视性,因为它们将当前状态与固定准则进行比较,而非未来实验的期望价值。本研究通过将设计与停止视为有限时段序贯决策问题中的耦合决策,提出了适用于序贯实验设计的贝叶斯最优停止框架。我们证明:对于任意固定设计策略,最优停止规则在即时终止奖励不小于期望继续价值时启动终止。继而推导出基于价值停止的连续设计策略学习的策略梯度方法。由于设计策略、持续价值和停止边界相互依赖,该优化问题极具挑战性,朴素训练易陷入早停局部最优。为此,我们引入课程学习策略,在训练过程中逐步从强制继续过渡到自适应停止。在线性高斯基准、非线性测试案例及污染源检测问题的数值研究表明,所提方法能学习稳定且资源感知的设计-停止联合策略,在强序贯依赖设定中取得最大增益。