Many Natural Language Generation (NLG) tasks aim to generate a single output text given an input prompt. Other settings require the generation of multiple texts, e.g., for Synthetic Traffic Generation (STG). This generation task is crucial for training and evaluating QA systems as well as conversational agents, where the goal is to generate multiple questions or utterances resembling the linguistic variability of real users. In this paper, we show that common NLG metrics, like BLEU, are not suitable for evaluating STG. We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts. We validate our metrics with an automatic procedure to verify whether they capture different types of quality issues of generated data; we also run human annotations to verify the correlation with human judgements. Experiments on three tasks, i.e., Shopping Utterance Generation, Product Question Generation and Query Auto Completion, demonstrate that our metrics are effective for evaluating STG tasks, and improve the agreement with human judgement up to 20% with respect to common NLG metrics. We believe these findings can pave the way towards better solutions for estimating the representativeness of synthetic text data.
翻译:许多自然语言生成(NLG)任务旨在根据给定输入提示生成单一输出文本。其他场景则要求生成多个文本,例如合成流量生成(STG)。该生成任务对于训练和评估问答系统以及对话代理至关重要,其目标是生成多个问题或话语,以模拟真实用户的语言变异性。本文表明,常见的NLG指标(如BLEU)不适合评估STG。我们提出并评估了多种旨在比较生成流量与真实用户文本分布的指标。我们通过自动程序验证这些指标是否能够捕获生成数据中不同类型的质量问题;同时开展人工标注以验证其与人类判断的相关性。在三个任务(即购物话语生成、产品问题生成和查询自动补全)上的实验表明,我们的指标能够有效评估STG任务,并且与常见NLG指标相比,与人类判断的一致性提升高达20%。我们相信这些发现可为评估合成文本数据代表性的更优解决方案奠定基础。