Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.
翻译:案例推理是美国法律实践的基石,要求从业者通过类比和区分过往判例来论证当前案件。尽管大语言模型已展现出卓越能力,但其在这种复杂而微妙推理形式中的熟练程度仍需深入研究。我们提出了一个形式化框架,将识别案件间重要区别的过程分解为三阶段推理任务。该框架使用称为"要素"的事实谓词对案件建模,将其组织成法律知识层次结构,并定义了可验证的规则用于识别区别、分析其论证支持度及评估重要性。通过对现代推理大语言模型的综合评估,我们发现了一个悖论:虽然模型在表层推理(任务1)上达到高准确率,但在层次化推理(任务2:64.82%-92.09%)上表现下降,在整合分析(任务3:11.46%-33.99%)中则完全崩溃。最引人注目的是,我们发现模型在错误回答上持续消耗比正确回答更多的计算资源,这表明"思考更久"并不总意味着"思考更聪明"。我们的工作为复杂领域中大语言模型推理能力的细粒度分析提供了方法论,并揭示了构建稳健可信的法律人工智能必须解决的根本性局限。