Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.
翻译:播客脚本生成要求大语言模型从多样化输入中合成结构化的、基于上下文的对话,然而针对该任务的系统性评估资源仍然有限。为弥补这一空白,我们提出了PodBench基准,该基准包含800个样本,输入长度可达21K词元,并包含复杂的多说话人指令。我们提出了一个多维度评估框架,将定量约束与大语言模型驱动的质量评估相结合。大量实验表明,尽管专有模型通常表现优异,但配备显式推理能力的开源模型在处理长上下文和多说话人协调方面,相较于标准基线展现出更优的鲁棒性。然而,我们的分析揭示了一个持续存在的差异:高指令遵循度并不能保证高内容实质性。PodBench为应对长篇幅、以音频为中心的生成任务中的这些挑战提供了一个可复现的测试平台。