Reinforcement learning can effectively learn amortised design policies for designing sequences of experiments. However, current methods rely on contrastive estimators of expected information gain, which require an exponential number of contrastive samples to achieve an unbiased estimation. We propose an alternative lower bound estimator, based on the cross-entropy of the joint model distribution and a flexible proposal distribution. This proposal distribution approximates the true posterior of the model parameters given the experimental history and the design policy. Our estimator requires no contrastive samples, can achieve more accurate estimates of high information gains, allows learning of superior design policies, and is compatible with implicit probabilistic models. We assess our algorithm's performance in various tasks, including continuous and discrete designs and explicit and implicit likelihoods.
翻译:强化学习能够有效学习用于设计实验序列的摊销设计策略。然而,当前方法依赖于期望信息增益的对比估计量,这需要指数数量的对比样本才能实现无偏估计。我们提出了一种基于联合模型分布与灵活提议分布交叉熵的替代下界估计量。该提议分布近似于给定实验历史与设计策略下模型参数的真实后验。我们的估计量无需对比样本,能够更精确地估计高信息增益,支持学习更优的设计策略,并且与隐式概率模型兼容。我们在各类任务中评估了算法性能,包括连续与离散设计、显式与隐式似然。