We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.
翻译:本研究探讨了如何通过推理来获取存储在语言模型参数中的世界知识。例如,回忆堪培拉是澳大利亚首都这一事实,可能得益于对主要城市和“为特定目的而建的首都”这一概念的逐步思考。虽然推理语言模型通过强化学习训练,能够在数学等任务上生成推理轨迹,但它们可能不擅长通过推理来获取自身存储的世界知识。我们首先发现,模型默认情况下不会生成其最佳的世界知识推理:添加简单的“逐步思考”提示能在知识回忆任务上带来统计学上的显著改进,但在数学任务上则不然。受此启发,我们提出使用世界知识问答作为可验证的奖励信号,训练模型对其参数化知识进行推理。在TriviaQA数据集上进行强化学习后(性能提升+9.9%),模型在Natural Questions、HotpotQA、SimpleQA和StrategyQA数据集上的性能也分别提升了4.2%、2.1%、0.6%和3.0%。研究表明,推理模型在参数化知识获取方面尚未得到充分优化,但可以通过训练轻松提升其推理能力。