This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.
翻译:本文描述了团队6在DSTC 11 Track 4竞赛中为ChatEval提交的系统。我们提出了三种基于大型语言模型预测聊天机器人回复轮次质量的不同方法。我们报告了通过使用向量存储中的动态少样本示例为ChatGPT提供提示,相较于基线方法的改进。我们还分析了另外两种方法的性能,并指出了未来工作所需的改进。我们在短短两周内开发了这三个系统,展示了大型语言模型在此任务中的潜力。竞赛截止日期后进行的消融研究表明,新的Llama 2模型正在缩小ChatGPT与开源大型语言模型之间的性能差距。然而,我们发现Llama 2模型在利用少样本示例方面与ChatGPT不同。