High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOPM, a hierarchical online prompt mutation framework evaluated on a real marketplace dispute-evidence workflow. HOPM treats prompts as online policies: a family/version router selects a prompt, deterministic guardrails attribute failures to mutable prompt-token categories, and dual feedback from human review and an automated judge updates both routing and mutation priorities. The primary evidence is an observed matched production-evaluation ablation: seven variants are evaluated on the same 600 cases each, enabling component comparisons against static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, auto-judge-only feedback, and full dual-loop HOPM. Full HOPM improves count win rate over a static control from 34.7% to 45.7% (+11.0 pp; paired McNemar p = 1.31e-11) and amount-weighted win rate from 22.3% to 41.4% (+19.1 pp; 95% paired bootstrap CI [10.3, 28.9] pp). It also increases mean Likert quality from 3.18 to 4.40 and reduces issue-flag rate from 15.3% to 5.2%. Supporting review artifacts cover 770 generated-text reviews, 318 labeled reviewer exports, a 10-case/61-rating calibration slice, and a 70-case/350-rating OCR benchmark; these artifacts calibrate rubric, guardrail, title-risk, and OCR-risk interpretation rather than substituting for the production ablation. The paper includes control setup, sample sizes, confidence intervals, paired tests, prompt-token categories, pseudocode, schema, rubric, guardrail taxonomy, and a constructed example so the evaluation structure can be reproduced without exposing proprietary evidence.
翻译:高风险生产文档生成系统要求语言模型具备适应性、基于证据且可审计。我们提出HOPM,一种分层在线提示变异框架,并在真实市场纠纷证据流程中进行评估。HOPM将提示视为在线策略:家族/版本路由器选择提示,确定性护栏将失败归因于可变提示-令牌类别,来自人工审核和自动评判者的双反馈更新路由和变异优先级。主要证据来自观察到的匹配生产评估消融实验:在相同600个案例上评估七个变体,实现组件对比,包括静态提示、手动迭代、仅bandit路由、仅变异适应、仅人工反馈、仅自动评判者反馈以及完整双环HOPM。完整HOPM将静态对照的计数胜率从34.7%提升至45.7%(+11.0个百分点;配对McNemar p=1.31e-11),金额加权胜率从22.3%提升至41.4%(+19.1个百分点;95%配对bootstrap置信区间[10.3, 28.9]个百分点)。它还使平均Likert质量从3.18提高至4.40,问题标记率从15.3%降至5.2%。支持性评审工件涵盖770条生成文本评审、318条标记评论者导出、一个10案例/61评分校准切片以及一个70案例/350评分OCR基准;这些工件校准评分表、护栏、标题风险和OCR风险解释,而非替代生产消融实验。本文包括控制设置、样本量、置信区间、配对检验、提示-令牌类别、伪代码、架构、评分表、护栏分类法及一个构造示例,使得评估结构可在不暴露专有证据的情况下复现。