Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.
翻译:大语言模型如今生成的法律文本至少达到中等质量,但现有基准均无法评估其是否具备法律教义推理能力——这一能力构成法律工作的解释性核心,而非当前大多数法律人工智能评估所衡量的辅助性法务辅助任务。这一测量缺口不仅是方法论层面的,更是法律层面的:欧盟《人工智能法案》将“适当准确性”作为司法领域高风险人工智能的约束性要求,然而若缺乏该领域亟需的法律教义推理基准,该要求便无法获得可操作的具体内涵。