Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).
翻译:由大型语言模型(LLM)生成的法律文本通常能够达到合理的事实准确性,但其往往难以遵循法律写作中专业的风格规范与语言惯例。为提升风格质量,建立可靠的评估方法是关键的第一步。然而,让法律专家手动制定此类评估标准并不现实,因为法律写作实践中隐含的风格要求难以形式化为明确的评分规则。同时,现有的自动评估方法也存在不足:基于参考文本的指标将语义准确性与风格忠实度混为一谈,而基于LLM作为评判者的评估则存在不透明与不一致的问题。为应对这些挑战,我们提出了CLASE(中文法律文本风格评估),一种专注于法律文本风格表现的混合评估方法。该方法采用了一种混合评分机制,结合了:1)基于语言特征的评分,以及2)经验引导的LLM作为评判者的评分。特征系数与LLM评分经验均从真实法律文档及其经LLM还原的对比样本对中学习得到。这种混合设计以透明、无需参考文本的方式,同时捕捉了表层特征与隐含的风格规范。在200份中文法律文档上的实验表明,CLASE相较于传统指标及纯LLM作为评判者的方法,与人类判断的一致性显著更高。除了更高的一致性外,CLASE还提供了可解释的分数细分与改进建议,为法律文本生成中的专业风格评估提供了一个可扩展且实用的解决方案(CLASE的代码与数据可在以下网址获取:https://github.com/rexera/CLASE)。