The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called ``Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents. The benchmark is available at https://github.com/nuaa-nlp/Character100.
翻译:大语言模型(LLMs)的进步显著提升了聊天机器人系统的性能。许多研究者致力于开发具有特征性的聊天机器人。尽管已出现利用LLMs构建角色驱动型聊天机器人的商业产品,但值得注意的是该领域的学术研究仍相对匮乏。本研究聚焦于探究LLMs在不同场景中通过模拟真实个体来构建特征性AI智能体的能力。现有研究主要关注具有简单背景设定的角色扮演。针对这一研究空白,我们创建了特征性AI智能体任务的基准测试,包含数据集、技术方法和评估指标。我们为此基准构建了名为"Character100"的数据集,涵盖维基百科访问量最高的100位人物供语言模型进行角色扮演。基于所构建的数据集,我们开展了跨不同场景的LLMs综合评估。此外,我们设计了一套用于量化性能评估的自动化指标。实验结果揭示了LLMs在构建特征性AI智能体能力方面有待进一步改进的潜在方向。该基准测试已发布于https://github.com/nuaa-nlp/Character100。