Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
翻译:过去几年中,语言模型的规模持续增大。它们在问答、摘要等多种自然语言处理任务上取得了高水平性能。大语言模型已用于文本生成,并能输出类人文本。因此,对话领域中的其他下游任务如今可以利用大语言模型的语言理解能力。对话评估是本文将要探讨的一项任务。本文聚焦于使用大语言模型进行提示学习:BLOOM、OPT、GPT-3、Flan-T5、InstructDial和TNLGv2。研究表明,用于训练模型的数据集选择既影响其在任务上的表现,也影响提示结构的设置方式。具体而言,模型训练所用数据集越多样且相关,对话评估的表现就越好。本文还探究了提示中示例数量及示例选择类型对模型性能的影响。