Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
翻译:人类在执行任务时会遵循标准,这些标准直接用于评估任务完成的质量。因此,让模型学会使用标准来提供反馈,可以帮助人类或模型更好地完成任务。然而,该领域的现有研究往往只考虑有限的标准或质量评估维度。为填补这一空白,我们提出了一种通用框架,使大型语言模型(LLMs)能够针对一项任务使用全面的标准,从而在任务执行过程中提供自然语言反馈。具体而言,我们提出了一种模型在环框架,该框架能够从不同写作任务的收集指南中半自动地推导出标准,并为每个标准构建上下文演示。我们从实际场景中选择了三个任务来实施这一思路:论文引言写作、Python代码编写和Reddit帖子撰写,并使用不同的LLMs评估了我们的反馈生成框架。结果揭示了纳入标准和演示的细粒度影响,并为如何更有效地教会LLMs使用标准提供了宝贵的见解。