The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
翻译:基于大语言模型(LLM)的用户模拟器在提升对话式人工智能方面前景广阔,但一个关键的“真实性鸿沟”阻碍了其发展,导致系统在模拟交互中表现优化,却可能在真实世界中失效。为此,我们引入了ConvApparel,这是一个旨在弥合此鸿沟的新的人机对话数据集。其独特的双智能体数据收集协议——同时使用“优质”和“劣质”推荐器——能够捕获广泛的用户体验,并结合用户满意度的第一人称标注,从而实现反事实验证。我们提出了一个综合验证框架,该框架结合了统计对齐度、拟人化评分和反事实验证,以检验模型的泛化能力。我们的实验揭示了所有模拟器均存在显著的真实性鸿沟。然而,该框架也表明,数据驱动的模拟器优于基于提示的基线模型,尤其是在反事实验证中,它们能更真实地适应未见过的用户行为,这表明它们体现了更稳健(尽管仍不完美)的用户模型。