Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

翻译：大语言模型（LLMs）越来越多地被用作黑箱优化的启发式顾问，但其建议和自我报告的置信度并不一定与下游目标值校准。这一问题在多目标贝叶斯优化中更加突出，因为不同目标可能需要不同的专家知识，而一个LLM专家可能对某个目标有用，但对另一个目标却具有误导性。我们研究如何在离散多目标贝叶斯优化中使用LLM生成的专家先验而不盲目信任它们。我们提出了一种逐目标声誉-市场机制，将每个专家-目标对视为可证伪的先验来源。专家权重根据观测到的目标反馈在线更新，随时间衰减，并由市场级别的信任门控。然后，我们引入一种解耦的反事实门控，可以在不使用置信度的情况下使用LLM先验、在使用置信度的情况下使用它，或完全放弃LLM先验。在受控的合成压力测试和三个使用\qwenflash{}生成的专家先验的分子优化基准中，我们发现动态的逐目标校准相比固定的LLM先验提高了鲁棒性。然而，原始的LLM置信度并非可靠有益：在ESOL上，置信度与预测误差正相关；在FreeSolv上，置信度可能有帮助；而在Lipophilicity上，忽略置信度仍然是最强的。我们的固定三臂反事实门控在ESOL和FreeSolv上优于第一个反事实变体，而尝试的边际投资组合暴露了一个有用的负面结果：边际选择应该基于采集感知，而不是仅基于单步先验误差。