Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.
翻译:贝叶斯优化是样本高效设计的核心工具,而潜在空间贝叶斯优化将其扩展至分子、蛋白质等结构化对象。与此同时,TabPFN、TabICL等表格基础模型已实现最先进的回归性能,并日益被用作贝叶斯优化的代理模型。由于这些模型的贝叶斯行为源自大规模合成预训练数据集,因此预训练数据分布的构成至关重要。潜在空间贝叶斯优化存在显著的错配:其潜在编码到目标值的映射与当前上下文模型训练所使用的回归任务存在本质差异。为应对这一错配,我们通过在分子变分自编码器的潜在空间上定义合成优化任务,对表格基础模型代理的预训练阶段进行补充。持续预训练目标函数包含一个正则化项,将模型锚定于原始检查点,在保留其广泛回归先验的同时避免过度适配于特定优化任务。在保留的分子优化基准测试中,所得模型展现出优异性能,验证了面向特定潜在空间贝叶斯优化的自适应方法对上下文代理模型的相关性。