Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

翻译：当前人工智能合规方法将合规性视为审计时的二元判定结果，而非生产系统持续可测量的属性。我们论证这种合规性假说在结构上无法满足欧盟《人工智能法案》的要求，该法案要求对已部署系统实施持续人工监督并检测新出现的行为漂移。我们提出"基于指标的治理"原则，将监管合规性视为运行时可观测性产生的连续信号，而非静态评估结果。基于此原则，我们开发了开源框架govllm，该框架实现了治理驱动的路由架构：模型选择由累积合规评分决定，而非仅依据延迟或成本。本方法核心是监管评审团——针对各监管标准（欧盟《人工智能法案》、GDPR、ANSSI、可访问性）专门化的大语言模型评估器——我们将评审员间的分歧重新阐释为监管不确定性信号（需人工仲裁），而非噪声。我们通过包含49组带标注的提示/响应对（涵盖五项监管标准）的地面真值语料库验证该方法，由四个完全本地运行的小语言模型（SLM，1.7B-7B参数）进行评测。一致率范围为51.5%（mistral:7b）至69.1%（phi4-mini），没有单一模型在所有标准中占据主导——这从经验上佐证了"画像陪审团"设计。我们进一步记录了小型监管评估器的三类结构性失效模式，以及将一致率降低高达25个百分点（在三种问题顺序条件下：原始、反转、随机排列）的评估器特定位置偏差。govllm已作为开源软件发布，以支持可复现的人工智能治理研究。