Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model's true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.
翻译:半结构化解释以显式表示描述推理器的隐式过程。这种解释阐明特定查询中可用信息如何被利用,并补充推理器从其内部权重中生成的信息,以产生答案。尽管语言模型的生成能力近期有所提升,但生成结构化解释以验证模型真实推理能力仍是一个挑战。这一问题在规模较小的语言模型(如 FLAN-T5-XXL)中尤为显著。在本工作中,我们首先强调监督微调(SFT)在应对这一挑战时的局限性,随后引入一种在强化学习(RL)中精心设计的奖励工程方法以更好地解决此问题。我们研究了多种奖励聚合方法,并提供了详细讨论,揭示了强化学习在未来研究中的巨大潜力。我们在两个半结构化解释生成基准(ExplaGraph 和 COPA-SSE)上提出的方法取得了新的最优结果。