We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt--response corpus and introduce delimiter tokens to explicitly mark field boundaries for domain adaptation. We further propose a two-stage post-training framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench under small-data settings. Code for our work is available here: https://github.com/zpointS/DiBO.
翻译:我们研究离线黑盒优化问题,旨在从包含设计方案及其标签的离线数据集中发现改进的设计,这是机器人学、DNA和材料科学中常见的问题,且标记样本有限。尽管近期工作通过将任务格式化为自然语言提示,将自回归大型语言模型应用于黑盒优化,但其从左到右的设计生成方式难以捕捉设计问题中固有的强双向依赖关系。为解决这一问题,我们提出将扩散语言模型适配至离线黑盒优化,以利用其双向建模能力。然而,扩散语言模型在自然文本预训练与黑盒优化中的异构信号(提示、设计方案和标签)之间存在领域差距。为弥合这一差距,我们构建了统一的提示-响应语料库,并引入分隔符标记以显式标记字段边界,实现领域自适应。我们进一步提出两阶段后训练框架,使扩散语言模型的生成与高标签设计方案对齐。第一阶段通过掩码响应预测在统一数据集上进行监督微调,第二阶段采用基于标签改进奖励的强化学习。我们的方法在Design-Bench数据集的小样本设置下取得了最先进的结果。相关代码开源地址:https://github.com/zpointS/DiBO。