Large Language Models (LLMs) are increasingly embedded in applications, and people can shape model behavior by editing prompt instructions. Yet encoding subtle, domain-specific policies into prompts is challenging. Although this process often benefits from concrete test cases, test data and prompt instructions are typically developed as separate artifacts, reflecting traditional machine learning practices in which model tuning was slow and test sets were static. We argue that the fast, iterative nature of prompt engineering calls for removing this separation and enabling a new workflow: data-prompt co-evolution, where a living test set and prompt instructions evolve in tandem. We present an interactive system that operationalizes this workflow. It guides application developers to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate revised prompts against a growing test set. A user study shows our workflow helps people refine prompts systematically, better aligning them with their intended policies. This work points toward more robust and responsible LLM applications through human-in-the-loop development.
翻译:大语言模型日益嵌入各类应用,用户可通过编辑提示指令塑造模型行为。然而,将微妙且领域特定的策略编码到提示指令中仍具挑战性。尽管该过程常受益于具体测试用例,但测试数据与提示指令通常作为独立工件开发,这反映了传统机器学习实践中模型调优缓慢、测试集静态化的特征。我们认为,提示工程的快速迭代特性要求打破这种割裂,建立新型工作流——数据-提示协同演化,即动态测试集与提示指令同步演进。本文提出一个实现该工作流的交互式系统,指导应用开发者发现边界案例、阐述期望行为的理论依据,并基于持续增长的测试集迭代评估修订后的提示。用户研究表明,该工作流能帮助用户系统性地优化提示,使其更精准地匹配预期策略。本研究通过人在回路开发范式,为构建更可靠、负责任的大语言模型应用指明了方向。