While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present an automated benchmarking pipeline, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. Our framework employs a structured pipeline comprising speaker-aware annotation, task-specific and context-driven prompt construction, response post-processing, and automated evaluation across multiple dimensions of generation quality. In particular, the pipeline performs text preprocessing and speaker labeling, constructs structured prompts with task instructions and LLM roles, validates response format, and evaluates valid outputs across fluency, personalization, diversity, and coherence. We assess the performance of four open-source and four closed-source LLMs using well-known datasets and a range of explicit metrics. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses, considering both the conversation context and the provided personas.
翻译:尽管大型语言模型(LLMs)已展现出令人印象深刻的对话能力,但其在生成个性化响应方面的熟练程度仍不明确。虽然近期研究通过基于LLM的自动评判机制,在角色扮演场景中对人物一致性进行了基准评估,但针对响应生成中个性化程度的评测仍显不足。为填补这一空白,我们提出了一个自动化基准测试流程PersoBench,用于在零样本设置下评估LLMs在人物感知对话生成中的个性化能力。该框架采用结构化流程,包含说话者感知标注、任务特定与上下文驱动的提示构建、响应后处理以及多维度生成质量的自动化评估。具体而言,流程执行文本预处理与说话者标注,构建包含任务指令和LLM角色的结构化提示,验证响应格式,并对有效输出在流畅性、个性化、多样性和连贯性等维度进行评估。我们使用知名数据集和一系列显式指标,对四个开源和四个闭源LLM进行了性能评估。研究结果表明,尽管LLMs在生成流畅且多样化的响应方面表现优异,但在结合对话上下文和给定人物信息生成个性化与连贯响应方面,其表现仍远未达到令人满意的程度。