When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}_{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}_{\text{safe}}$'s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.
翻译:当语言模型通过强化学习(RL)训练以编写概率程序时,它们可以人为地提高其边际似然奖励,通过生成数据分布未能归一化而非更好拟合数据的程序。我们将这种失败称为似然攻击(LH)。我们在一个核心概率编程语言(PPL)中形式化定义了LH,并给出了防止其发生的充分语法条件,证明满足这些条件的安全语言片段$\mathcal{L}_{\text{safe}}$无法产生似然攻击程序。在实验上,我们表明经过GRPO训练的、生成PyMC代码的模型会在最初的几个训练步骤中发现LH漏洞,使得违规率远远高于未训练模型的基线。我们将$\mathcal{L}_{\text{safe}}$的条件实现为$\texttt{SafeStan}$(一种抗LH的Stan修改版),并通过实验表明,在优化压力下它能有效防止LH。这些结果表明,语言层面的安全约束既有理论依据,又在自动化贝叶斯模型发现实践中有效。