Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.
翻译:将对话基于现有段落进行锚定,即检索增强生成(RAG),是由大型语言模型(LLM)驱动的聊天助手确保信息忠实性、避免提供错误信息的重要方面。目前已有若干基准被创建用于衡量LLM在此任务上的性能。本文提出一项纵向研究,比较内部与外部人工标注者群体在为评估LLM而创建多轮RAG对话这一复杂标注任务中的反馈回路。我们分析了两组标注者生成的对话,并通过问卷调查比较了他们的工作体验。本研究揭示了不同标注群体的优势及不同反馈回路的影响:更紧密的反馈回路能产生更高质量的对话,但会降低产出数量与多样性。此外,我们为如何在执行标注任务(尤其是复杂任务)时最优利用两种不同群体提供了指导原则。