Large Language Models (LLMs) are increasingly embedded in applications, and people can shape model behavior by editing prompt instructions. Yet encoding subtle, domain-specific policies into prompts is challenging. Although this process often benefits from concrete test cases, test data and prompt instructions are typically developed as separate artifacts, reflecting traditional machine learning practices in which model tuning was slow and test sets were static. We argue that the fast, iterative nature of prompt engineering calls for removing this separation and enabling a new workflow: data-prompt co-evolution, where a living test set and prompt instructions evolve in tandem. We present an interactive system that operationalizes this workflow. It guides application developers to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate revised prompts against a growing test set. A user study shows our workflow helps people refine prompts systematically, better aligning them with their intended policies. This work points toward more robust and responsible LLM applications through human-in-the-loop development.
翻译:大语言模型正日益嵌入各类应用中,人们可通过编辑提示指令来塑造模型行为。然而,将微妙且领域特定的策略编码到提示中具有挑战性。尽管这一过程通常受益于具体测试用例,但测试数据与提示指令通常作为独立产物开发,这反映了传统机器学习实践中模型调优缓慢且测试集静态的特点。我们认为,提示工程快速迭代的特性要求打破这种分离,并启用新的工作流程:数据-提示协同演化,即动态测试集与提示指令同步演进。我们提出一个实现该工作流程的交互式系统,引导应用开发者发现边缘案例、阐明期望行为的原理,并针对不断扩展的测试集迭代评估修订后的提示。用户研究表明,我们的工作流程能帮助人们系统化地优化提示,使其更符合预期策略。这项工作通过人在回路的开发模式,为构建更稳健、更负责任的大语言模型应用指明了方向。