We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt-response corpus and introduce delimiter tokens to explicitly mark field boundaries for domain adaptation. We further propose a two-stage post-training framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings.
翻译:我们研究离线黑盒优化(BBO)问题,旨在从包含设计方案与对应标签的离线数据集中发现改进的设计方案。该问题常见于机器人、DNA及材料科学领域,且标注样本数量有限。近期研究通过将任务转化为自然语言提示,将自回归大语言模型应用于BBO,但其从左至右的设计生成方式难以捕捉设计问题中固有的强双向依赖关系。为此,我们提出将扩散大语言模型适配至离线BBO场景,以利用其双向建模能力。然而,扩散大语言模型在自然文本上的预训练与BBO中异构信号(提示、设计方案与标签)之间存在领域差异。为弥合这一差异,我们构建了统一的提示-响应语料库,并引入分隔符标记来显式标注字段边界以实现领域自适应。我们进一步提出两阶段后训练框架,使扩散大语言模型的生成结果与高标签值的设计方案对齐:第一阶段通过掩码响应预测在统一数据集上进行监督微调,第二阶段采用基于标签值改进的奖励信号进行强化学习。我们的方法在Design-Bench小样本设置中取得了最先进的性能。