In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
翻译:在离线模型优化中,我们致力于利用机器学习设计候选方案,以最大化某种通过昂贵但真实世界评分过程定义的理想度指标。离线模型优化试图近似这种昂贵的评分函数,并据此评估生成的方案,然而由于一个近似函数被另一个近似函数评估,这种评估是不精确的。相反,我们提出疑问:如果手头确实拥有真实世界的评分函数,哪些计算成本低廉的验证指标能与之最佳相关?由于模拟的离线模型优化数据集可直接获取真实世界评分函数,由此获得的见解可迁移至真实离线模型优化任务(其中真实评分函数计算成本高昂)。为解决此问题,我们提出一个适用于衡量外推能力的概念性评估框架,并将其应用于条件去噪扩散模型。实验表明,两个验证指标——一致性与弗雷歇距离——与真实值高度相关。当条件生成存在高变异性时,需要以真实世界评分函数的近似版本形式提供反馈。此外,我们发现生成高评分样本可能需要显著侧重生成模型的样本质量,这可能会以牺牲样本多样性为代价。