While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.
翻译:语音质量通常基于完整语音进行评价,但流式系统和生成系统需要从部分音频中实现增量估计。现有预测模型假设具有完整上下文,在受前缀约束的输入上性能下降。本文在ARECHO基础上提出ANCHOR,将增量评估重新定义为多分辨率自回归任务。该方法通过双分辨率令牌和面向分辨率层次的粗到细细化机制,在单一解码器中同时建模分块级和整句级质量。实验表明在部分输入条件下具有显著鲁棒性,其中在2秒前缀上PLCMOS误差降低48%。收敛分析揭示4-6秒的有效感知上下文窗口。压力测试进一步分离了局部化损坏下的结构化外推偏差。结果表明,层次监督可改善增量预测,并阐明感知质量随时间累积的机制。