When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

翻译：评估大型语言模型（LLM）应用与传统的软件测试不同，因为其输出具有概率性、语义可变性，并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件（MVES），这是一种面向审计的应用级LLM评估架构。MVES将应用类别与故障模式、指标、必要工件和验证证据相链接，涵盖通用LLM应用、检索增强系统和智能体工作流。我们为该框架配套开发了一套可复现的本地评估工具，支持结构化提取、RAG引用/内容合规性检查以及指令遵循验证。使用Ollama平台上的Llama 3 8B Instruct和Qwen 2.5 7B Instruct模型，我们评估了五种提示条件，并在每套件30个案例的扩展消融实验中进行测试。结果表明，在测试的本地条件下，通用提示的添加并未带来单调改进：更严格的输出约束提示提升了两款模型的结构化提取能力，而RAG引用/内容合规性在某些通用规则条件下出现下降。Qwen 2.5在RAG任务中表现出的最大降幅出现在用户提示后追加通用规则时，得分从26/30降至9/30。这些发现支持评估驱动的提示迭代：提示变更应被视为潜在的回归风险，并在部署前使用特定任务的测试套件进行验证。随附的代码仓库包含测试套件、提示变体、评估工具、原始结果日志以及复现所述本地消融实验所需的脚本。