Text-to-audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large-scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non-experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective- and dimension-level differences across model families. We also propose Qwen-DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi-dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.
翻译:文本到音频(TTA)生成技术发展迅速,但其评估仍面临挑战,因为人工听感研究成本高昂,且现有自动指标仅能捕捉感知质量的有限方面。我们提出了AudioEval,一个大规模TTA评估数据集,包含来自24个系统生成的4,200个音频样本(总计11.7小时),以及从专家和非专家处收集的、涵盖愉悦度、实用性、复杂度、质量与文本对齐度五个维度的126,000条评分。基于AudioEval,我们对多种自动评估器进行了基准测试,以比较不同模型族在视角和维度层面的差异。我们还提出了Qwen-DisQA作为一个强参考基线:它联合处理提示词与生成音频,为两组标注者预测多维度评分,并通过分布预测对评分者分歧进行建模,取得了优异的性能。我们将公开AudioEval以支持未来TTA评估的研究。