In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle, which is expensive to compute since it involves executing a real world process. In offline MBO we wish to do so without assuming access to such an oracle during training or validation, with makes evaluation non-straightforward. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. Measuring the mean reward of generated candidates over this approximation is one such `validation metric', whereas we are interested in a more fundamental question which is finding which validation metrics correlate the most with the ground truth. This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation, which is the ultimate goal behind leveraging generative models for MBO. While our evaluation framework is model agnostic we specifically evaluate denoising diffusion models due to their state-of-the-art performance, as well as derive interesting insights such as ranking the most effective validation metrics as well as discussing important hyperparameters.
翻译:在模型优化(MBO)中,我们关注利用机器学习设计候选方案,以最大化相对于黑箱函数(称为真实值预测器)的某种奖励度量。由于该函数涉及执行真实世界过程,其计算成本高昂。在离线MBO中,我们希望在训练或验证期间不假设能访问此类预测器的情况下进行优化,这使得评估过程不直接。虽然可以训练一个真实值预测器的近似模型,并在模型验证期间替代它来测量生成候选方案的平均奖励,但这种评估是近似的,且容易受到对抗样本的影响。测量生成候选方案在此近似模型上的平均奖励是一种"验证指标",而我们更关注一个更根本的问题:找出哪些验证指标与真实值相关性最强。这涉及提出验证指标,并在许多已知真实值的数据集(例如模拟环境)上量化这些指标。这一过程被囊括在我们提出的评估框架中,该框架还旨在衡量外推能力——这是利用生成模型进行MBO的最终目标。虽然我们的评估框架与模型无关,但我们特别评估了去噪扩散模型(因其具有最先进的性能),并得出了有趣见解,例如对最有效的验证指标进行排序,以及讨论重要的超参数。