Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mismatch makes ordinary post-hoc benchmarking insufficient for systems that must be safe, reliable, auditable, and economically useful. This paper contributes an evaluation-protocol extension for operational LLM systems grounded in acceptance-test-driven development, safety engineering, and business-centric validation. The extension translates stakeholder goals into executable behavioral contracts, release gates, monitoring signals, and evidence artifacts before prompt, model, retrieval, or agent changes are accepted. It adapts the red-green-refactor discipline of test-driven development to a red-train-green lifecycle: first define failing acceptance tests for desired behavior, then improve the LLM system through prompt changes, retrieval design, fine-tuning, guardrails, or data augmentation, and finally release only when multidimensional gates are satisfied. The contribution is a governance-oriented metric stack, reference architecture, and empirical protocol for comparing acceptance-test-driven LLM development against prompt-first and benchmark-after workflows.
翻译:大语言模型应用在依赖概率生成组件的同时,日益需要满足确定性机构要求。这种不匹配使得常规的事后测评对于必须安全、可靠、可审计且具有经济价值的系统而言难以胜任。本文提出一种面向运营中大语言模型系统的评估协议扩展,其基础是验收测试驱动开发、安全工程和以商业为中心的验证。该扩展在prompt、模型、检索或智能体变更被接受前,将利益相关者目标转化为可执行的行为契约、发布门禁、监控信号和证据工件。它将测试驱动开发中的红-绿-重构规范调整为红-训-绿生命周期:首先为期望行为定义失败的验收测试,随后通过prompt调整、检索设计、微调、护栏或数据增强改进大语言模型系统,最终仅在满足多维门禁条件时发布。本文的核心贡献在于提出面向治理的度量堆栈、参考架构和实证协议,用以比较验收测试驱动的大语言模型开发与prompt优先和事后基准测试工作流。