Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
翻译:大语言模型(LLM)评判者展现出强大的推理能力,但仅限于处理文本内容。这使得当前自动语音到语音(S2S)评估方法依赖于不透明且昂贵的音频语言模型(ALMs)。在本工作中,我们提出了TRACE(基于音频线索的文本推理评估框架),这是一种新颖的框架,使LLM评判者能够基于音频线索进行推理,从而实现成本高效且与人类评估对齐的S2S评估。为展示该框架的优势,我们首先引入了一种人类思维链(HCoT)标注协议,通过将评估分解为明确维度——内容(C)、语音质量(VQ)和副语言特征(P),以提升现有评判基准的诊断能力。利用这些数据,TRACE构建了低成本音频信号的文本蓝图,并提示LLM进行分维度评判,再通过确定性策略将其融合为总体评分。与ALMs及仅基于转译文本的LLM评判者相比,TRACE与人类评分者的一致性更高,同时成本效益显著提升。我们将公开HCoT标注数据和TRACE框架,以支持可扩展且与人类对齐的S2S评估。