Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths.
翻译:生成能够证明评分决策合理性的理由,已成为促进自动评分系统可解释性的一种有前景的方法。然而,现有方法在准确性上无法与基于分类器的方法相媲美。此外,生成的理由常常包含虚假信息。为解决这些问题,我们提出了一种新颖的框架,该框架能够生成更忠实可信的理由,并且更重要的是,其性能可与基于分类器的黑盒评分系统相匹敌。我们首先通过查询大型语言模型(LLMs)生成思维树来模拟人类评估过程。然后,我们从每条思维树路径中总结出中间评估决策,用于创建合成理由数据和理由偏好数据。最后,我们利用生成的合成数据,通过一个两步训练过程来校准LLMs:监督微调和偏好优化。大量的实验结果表明,与先前工作相比,我们的框架在QWK分数上实现了38%的评估性能提升,同时能生成更高质量的理由,这一点得到了人类评估者和LLMs的认可。我们的工作揭示了利用从思维树路径获得的合成偏好数据进行偏好优化的有效性。