In many settings in science and industry, such as drug discovery and clinical trials, a central challenge is designing experiments under time and budget constraints. Bayesian Optimal Experimental Design (BOED) is a paradigm to pick maximally informative designs that has been increasingly applied to such problems. During training, BOED selects inputs according to a pre-determined acquisition criterion to target informativeness. During testing, the model learned during training encounters a naturally occurring distribution of test samples. This leads to an instance of covariate shift, where the train and test samples are drawn from different distributions (the training samples are not representative of the test distribution). Prior work has shown that in the presence of model misspecification, covariate shift amplifies generalization error. Our first contribution is to provide a mathematical analysis of generalization error that reveals key contributors to generalization error in the presence of model misspecification. We show that generalization error under misspecification is the result of, in addition to covariate shift, a phenomenon we term error (de-)amplification which has not been identified or studied in prior work. We then develop a new acquisition function that mitigates the effects of model misspecification by including terms for representativeness, informativeness, and de-amplification (R-IDeA). Our experimental results demonstrate that the proposed method performs better than methods that target either only informativeness, representativeness, or both.
翻译:在科学与工业的诸多领域,如药物发现与临床试验中,一个核心挑战是在时间和预算约束下设计实验。贝叶斯最优实验设计(BOED)是一种选择最大化信息性设计的范式,已日益应用于此类问题。在训练阶段,BOED根据预设的获取准则选择输入以针对信息性。在测试阶段,训练期间学得的模型会遇到自然出现的测试样本分布。这导致了协变量偏移的一个实例,即训练样本与测试样本来自不同的分布(训练样本不能代表测试分布)。先前研究表明,在模型误设存在的情况下,协变量偏移会放大泛化误差。我们的第一个贡献是提供了一个泛化误差的数学分析,揭示了模型误设下影响泛化误差的关键因素。我们证明,误设下的泛化误差除了源于协变量偏移外,还源于一种我们称为误差(去)放大的现象,该现象在先前工作中未被识别或研究。随后,我们开发了一种新的获取函数,通过纳入代表性、信息性与去放大(R-IDeA)项来减轻模型误设的影响。我们的实验结果表明,所提方法优于仅针对信息性、代表性或两者兼顾的方法。