Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify model's true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs, as the reasoner is expected to couple a sequential answer with a structured explanation which embodies both the correct presentation and the correct reasoning process. In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed reward on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.
翻译:半结构化解释以显式表示描述推理者的隐式过程。这种解释凸显了在针对特定查询时,现有信息如何与推理者从其内部权重中生成的信息相结合,从而产生答案。尽管语言模型的生成能力近期有所提升,但生成结构化解释以验证模型真实推理能力仍然是一个挑战。这一问题对于规模不太大的语言模型尤为突出,因为推理者需要将序列化答案与同时体现正确呈现形式和正确推理过程的结构化解释相结合。在本工作中,我们首先强调了监督微调在应对这一挑战中的局限性,随后引入了一种精心设计的强化学习奖励工程方法以更有效地解决该问题。我们探究了多种奖励聚合方法,并提供了详细讨论,揭示了强化学习在未来研究中的巨大潜力。我们提出的奖励方法在两个半结构化解释生成基准(ExplaGraph 和 COPA-SSE)上取得了新的最佳结果。