Synthetic data is increasingly critical for contact centers, where privacy constraints and data scarcity limit the availability of real conversations. However, generating synthetic dialogues that are realistic and useful for downstream applications remains challenging. In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages. To test downstream utility, we evaluate synthetic transcripts on an automated quality assurance (AutoQA) task, finding that prompts optimized on real transcripts consistently outperform those optimized on synthetic transcripts. These results suggest that current synthetic transcripts fall short in capturing the full realism of real agent-customer interactions. To highlight these downstream gaps, we introduce a diagnostic evaluation framework comprising 17 metrics across four dimensions: (1) Emotional and Sentiment Arcs, (2) Linguistic Complexity, (3) Interaction Style, and (4) Conversational Properties. Our analysis shows that even with structured supervision, current generation strategies exhibit measurable deficiencies in sentiment fidelity, disfluency modeling, behavioral variation, and conversational realism. Together, these results highlight the importance of diagnostic, metric-driven evaluation for synthetic conversation generation intended for downstream applications.
翻译:在联络中心领域,隐私限制和数据稀缺制约了真实对话的可用性,合成数据因此变得日益重要。然而,生成既真实又能有效服务于下游应用的合成对话仍具挑战。本研究基于结构化监督(涵盖通话属性:意图摘要、话题流和质量保证表单),对多种语言下的多种生成策略进行了基准测试。为检验下游效用,我们在自动化质量保证任务上评估了合成转录文本,发现基于真实转录文本优化的提示词持续优于基于合成转录文本优化的提示词。这些结果表明,当前的合成转录文本在捕捉真实坐席-客户互动的完整真实性方面仍存在不足。为揭示这些下游差距,我们引入了一个诊断性评估框架,该框架包含四个维度的17项指标:(1)情感与情绪弧线,(2)语言复杂度,(3)交互风格,以及(4)对话属性。我们的分析表明,即使在结构化监督下,当前的生成策略在情感保真度、非流利性建模、行为变异性和对话真实性方面仍存在可量化的缺陷。综上所述,这些结果凸显了针对下游应用的合成对话生成进行诊断性、指标驱动的评估的重要性。