Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.
翻译:尽管文本到音频生成在真实性和多样性方面取得了显著进展,但评估指标的发展却未能同步跟进。目前广泛采用的方法(通常基于嵌入相似性,如CLAPScore)虽能有效衡量一般相关性,但在细粒度语义对齐和组合推理方面仍存在局限。为此,我们提出了AQAScore,一个与骨干模型无关的评估框架,该框架利用音频感知大语言模型(ALLMs)的推理能力。AQAScore将评估重新定义为概率语义验证任务;它并非依赖开放式文本生成,而是通过计算对目标语义查询回答“是”的精确对数概率来估计对齐程度。我们在多个基准测试中评估AQAScore,包括人工评定的相关性、成对比较和组合推理任务。实验结果表明,与基于相似性的指标和生成式提示基线相比,AQAScore始终与人类判断保持更高的相关性,证明了其在捕捉细微语义不一致性以及随底层ALLMs能力扩展方面的有效性。