Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain

Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$Δ$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$Δ$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$Δ$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$Δ$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.

翻译：法律人工智能（LegalAI）在大语言模型的支持下，在司法决策自动化方面取得了显著进展。然而，现有的法律大语言模型在生成可靠且可解释的推理过程方面仍面临困难。它们通常默认采用快速思维模式，直接给出答案而缺乏明确的多步骤推理，这限制了其在需要严格论证的复杂法律场景中的有效性。为应对这一挑战，我们提出了Legal$Δ$，一个旨在通过思维链引导的信息增益来增强法律推理的强化学习框架。在训练过程中，Legal$Δ$采用双模式输入设置——包含直接答案模式和推理增强模式——并最大化两者之间的信息增益。这促使模型学习有意义的推理模式，而非生成肤浅或冗余的解释。Legal$Δ$遵循两阶段方法：（1）从强大的大型推理模型DeepSeek-R1中提炼潜在的推理能力；（2）通过差异比较并结合评估结构连贯性和法律领域特异性的多维奖励机制来优化推理质量。在多个法律推理任务上的实验结果表明，Legal$Δ$在准确性和可解释性方面均优于强基线模型。它能够持续生成更稳健、更可信的法律判决，且无需依赖标注的偏好数据。所有代码和数据将在https://github.com/NEUIR/LegalDelta 发布。