Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.
翻译:基于可验证奖励的强化学习(RLVR)已成为提升大语言模型正确性的可行方法。然而在许多科学问题中,目标并非必然产生正确答案,而是生成满足约束条件的多样化候选方案。我们在材料生成这一背景下研究该挑战,并由此提出PLaID++——一种经过后训练、用于稳定且物性引导晶体生成的大语言模型。研究发现,性能优劣取决于晶体学表示与奖励函数的设计。首先,我们引入一种紧凑且具有对称性信息的Wyckoff文本表示方法,该方法提升了计算效率并促进了基于物理先验的泛化能力。其次,我们证明温度缩放作为熵正则化项,可有效抑制模态坍塌并促进探索行为。通过将对称性约束直接编码到文本中,并引导模型输出趋向理想化学空间,PLaID++生成了热力学稳定、独特且新颖的结构——其生成速率较现有方法提升约50%,并能按条件生成具有所需空间群属性的结构。本研究证明了将自然语言处理中的后训练技术适配至材料设计的潜力,为靶向且高效的新型材料发现铺平了道路。