Ensuring large language models' (LLMs) responses align with prompt instructions is crucial for application development. Based on our formative study with industry professionals, the alignment requires heavy human involvement and tedious trial-and-error especially when there are many instructions in the prompt. To address these challenges, we introduce CoPrompter, a framework that identifies misalignment based on assessing multiple LLM responses with criteria. It proposes a method to generate evaluation criteria questions derived directly from prompt requirements and an interface to turn these questions into a user-editable checklist. Our user study with industry prompt engineers shows that CoPrompter improves the ability to identify and refine instruction alignment with prompt requirements over traditional methods, helps them understand where and how frequently models fail to follow user's prompt requirements, and helps in clarifying their own requirements, giving them greater control over the response evaluation process. We also present the design lessons to underscore our system's potential to streamline the prompt engineering process.
翻译:确保大型语言模型(LLM)的响应与提示指令对齐对于应用开发至关重要。基于我们对行业专业人员的形成性研究,这种对齐需要大量的人工参与和繁琐的试错过程,尤其是在提示中包含多条指令时。为应对这些挑战,我们提出了CoPrompter框架,该框架通过依据标准评估多个LLM响应来识别未对齐情况。它提出了一种方法,能够直接从提示要求生成评估标准问题,并通过一个界面将这些问題转化为用户可编辑的检查清单。我们对行业提示工程师进行的用户研究表明,与传统方法相比,CoPrompter提高了识别和优化指令与提示要求对齐的能力,帮助他们理解模型在何处以及以何种频率未能遵循用户的提示要求,并有助于澄清他们自身的需求,从而赋予他们对响应评估过程更大的控制权。我们还提出了设计经验,以强调我们的系统在简化提示工程流程方面的潜力。