Existing methods for creating source-grounded information-seeking dialog datasets are often costly and hard to implement due to their sole reliance on human annotators. We propose combining large language models (LLMs) prompting with human expertise for more efficient and reliable data generation. Instead of the labor-intensive Wizard-of-Oz (WOZ) method, where two annotators generate a dialog from scratch, role-playing agent and user, we use LLM generation to simulate the two roles. Annotators then verify the output and augment it with attribution data. We demonstrate our method by constructing MISeD -- Meeting Information Seeking Dialogs dataset -- the first information-seeking dialog dataset focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance on our test set, as well as on a novel fully-manual WOZ test set and an existing query-based summarization benchmark, suggesting the utility of our approach.
翻译:现有的构建基于源文本的信息检索对话数据集方法通常成本高昂且难以实施,因为它们完全依赖人工标注者。我们提出将大语言模型(LLM)提示与人类专业知识相结合,以实现更高效、更可靠的数据生成。我们摒弃了劳动密集型的"绿野仙踪"(WOZ)方法(即由两名标注者分别扮演代理和用户角色,从头生成对话),而是利用LLM生成来模拟这两个角色。随后,标注者对输出结果进行验证,并补充归因数据。我们通过构建MISeD——会议信息检索对话数据集——来演示我们的方法,这是首个专注于会议记录的信息检索对话数据集。使用MISeD微调的模型在我们的测试集、全新的全人工WOZ测试集以及现有的基于查询的摘要基准上都表现出优越性能,这证明了我们方法的实用性。