We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and agent, where multiple tasks are introduced and then undertaken concurrently. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents. Results from both proprietary and open-source Large-Language Models show that LLMs in general perform well on single-task interactions, but they struggle on the same tasks when they are interleaved. Notably, short-context LLMs supplemented with an LTM system perform as well as or better than those with larger contexts. Our benchmark suggests that there are other challenges for LLMs responding to more natural interactions that contemporary benchmarks have heretofore not been able to capture.
翻译:我们引入了一种用于对话代理的动态基准测试系统,该系统通过单一、模拟且冗长的用户$\leftrightarrow$代理交互来评估其性能。该交互是用户与代理之间的对话,其中引入多个任务并随后并发执行。我们定期进行上下文切换以交错执行这些任务,从而构建了一个真实的测试场景,用于评估代理的长期记忆、持续学习和信息整合能力。对专有及开源大型语言模型的测试结果表明,LLMs 通常在单任务交互中表现良好,但在相同任务交错执行时则表现不佳。值得注意的是,配备了长期记忆系统的短上下文 LLMs 的表现与具有更大上下文的模型相当甚至更优。我们的基准测试表明,对于响应更自然的交互,LLMs 还面临其他挑战,这是当前基准测试迄今未能捕捉到的。