Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.
翻译:大语言模型(LLMs)常被用于事实演变的场景中,但通过非结构化文本微调进行事实知识更新面临两大难题:1)依赖高计算量的释义增强;2)逆向诅咒。近期研究表明,扩散大语言模型(dLLMs)在预训练中需要更少的训练样本即可实现更低的损失,且对逆向诅咒更具抵抗力,暗示dLLMs可能比自回归大语言模型(arLLMs)更容易学习新知识。我们在受控知识微调实验中检验了这一假设,发现尽管arLLMs依赖释义增强将知识文本泛化为问答(QA)能力,但dLLMs无需释义即可实现高QA准确率。为进一步探究是否仅凭去掩码目标就能在dLLMs中诱发这种知识注入优势(无论其扩散去噪范式如何),我们提出面向arLLMs的掩码微调方法,该方法引导arLLM根据上下文中的掩码版本重建原始文本。针对arLLMs的掩码微调显著提升了知识注入效果——即无需释义且抗逆向诅咒,缩小了arLLMs与dLLMs之间的差距。我们还展示了更广泛的应用性:在大型知识密集型数据集(120万样本)上,掩码SFT在GPQA-diamond基准测试中取得了优于所有微调变体的下游准确率。此外,去掩码目标还提升了数学任务的SFT效果,表明其具有超越事实知识注入的广泛适用性。