Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
翻译:在专家领域(如法律)中评估大语言模型生成推理轨迹的质量对于确保可信度和可解释性至关重要,但由于此类推理任务固有的复杂性,评估仍具挑战性。我们引入了LEGIT(法律问题树),这是一个新颖的大规模(24K实例)专家级法律推理数据集,重点在于推理轨迹的评估。我们将法院判决转化为对立双方论点及法院结论的层次化树状结构,以此作为评估推理轨迹问题覆盖范围和正确性的标准。我们通过人类专家标注以及与粗糙、信息量较少的标准进行比较,验证了这些标准的可靠性。利用LEGIT数据集,我们表明:(1)大语言模型的法律推理能力受到法律问题覆盖范围和正确性的严重影响;(2)检索增强生成(RAG)和基于标准的强化学习(RL)对法律推理能力带来互补性益处,其中RAG提升了整体推理能力,而RL则提高了正确性,尽管覆盖范围有所减少。