Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
翻译:评估大语言模型在法律推理上的表现,需要涵盖任务设计、专家标注、模型执行和基于指标的评估等工作流程。实际上,这些步骤分散在不同的平台和脚本中,限制了透明度、可重复性以及非技术性法律专家的参与。我们提出了 BenGER(德国法律基准测试)框架,这是一个开源的网页平台,集成了任务创建、协作标注、可配置的大语言模型运行,以及基于词汇、语义、事实和法官指标的评估。BenGER 支持具有租户隔离和基于角色访问控制的多组织项目,并可选择性地为标注者提供基于参考标准的形成性反馈。我们将展示一个实时部署,演示端到端的基准测试创建与分析。