We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.
翻译:我们提出一种针对LLM和RAG应用的就绪管控机制,将评估转化为部署决策工作流。该系统在最小化API合约下整合自动化基准测试、OpenTelemetry可观测性与CI质量门禁,进而将工作流成功率、策略合规性、基于事实的一致性、检索命中率、成本及p95延迟聚合为场景加权就绪分数,并辅以帕累托前沿分析。我们在工单路由工作流及BEIR基准测试(SciFact和FiQA)上对该管控系统进行验证,涵盖完整Azure矩阵(数据集、场景、检索深度、随机种子及模型的162/162有效单元格)。结果表明,就绪度并非单一指标:在FiQA数据集上,采用“服务质量优先”策略且k=5时,gpt-4.1-mini在就绪度与忠实度方面表现领先,而gpt-5.2则需承担显著延迟成本;在SciFact数据集上,各模型质量差异较小,但操作层面仍可区分。工单路由回归门控始终能拦截不安全提示变体,证明该管控系统可阻断高风险发布,而不仅限于报告离线评分。最终得到一套可复现且基于实际操作的就绪决策框架,用于判定LLM或RAG系统是否具备发布条件。