Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and, noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.
翻译:语音信号分析面临显著挑战,尤其在语音质量评估与特征分析等任务中,其目标在于预测多种感知性与客观性指标。例如,PESQ(感知语音质量评估)、STOI(短时客观可懂度)和MOS(平均意见得分)等指标分别捕捉语音质量的不同维度。然而,这些指标常具有不同的量纲、假设与依赖关系,使得联合估计变得复杂。为解决这些问题,我们提出了ARECHO(基于链式假设优化的自回归评估方法),这是一种基于自回归依赖建模的链式通用语音评估系统。ARECHO具有三项关键创新:(1)全面的语音信息标记化流程;(2)显式捕捉指标间依赖关系的动态分类器链;(3)增强推理可靠性的两步式置信导向解码算法。实验表明,在增强语音分析、语音生成评估及含噪语音评估等多种场景中,ARECHO显著优于基线框架。此外,其动态依赖建模通过捕捉指标间关联提升了可解释性。在所有任务中,ARECHO利用动态分类器链实现无需参考的评估,支持子集查询(单一或多个指标),并通过置信导向解码减少误差传播。