Autonomous conversational agents, i.e. chatbots, are becoming an increasingly common mechanism for enterprises to provide support to customers and partners. In order to rate chatbots, especially ones powered by Generative AI tools like Large Language Models (LLMs) we need to be able to accurately assess their performance. This is where chatbot benchmarking becomes important. In this paper, we propose the use of a novel benchmark that we call the E2E (End to End) benchmark, and show how the E2E benchmark can be used to evaluate accuracy and usefulness of the answers provided by chatbots, especially ones powered by LLMs. We evaluate an example chatbot at different levels of sophistication based on both our E2E benchmark, as well as other available metrics commonly used in the state of art, and observe that the proposed benchmark show better results compared to others. In addition, while some metrics proved to be unpredictable, the metric associated with the E2E benchmark, which uses cosine similarity performed well in evaluating chatbots. The performance of our best models shows that there are several benefits of using the cosine similarity score as a metric in the E2E benchmark.
翻译:自主对话代理(即聊天机器人)正逐渐成为企业为客户和合作伙伴提供支持的常见机制。为了评估聊天机器人(尤其是由大型语言模型等生成式AI工具驱动的聊天机器人)的性能,我们需要能够准确度量其表现。这正是聊天机器人基准测试的重要性所在。本文提出了一种名为E2E(端到端)的新型基准测试方法,并展示了如何利用E2E基准测试评估聊天机器人(尤其是基于LLM的聊天机器人)所提供回答的准确性与有用性。我们基于E2E基准测试和现有技术中常用的其他指标,对不同复杂程度的示例聊天机器人进行了评估,结果表明所提出的基准测试相较于其他方法具有更优表现。此外,尽管部分指标表现出不可预测性,但E2E基准测试中采用余弦相似度的关联指标在评估聊天机器人时表现良好。最优模型的性能表明,在E2E基准测试中使用余弦相似度得分作为指标具有多项优势。