Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.
翻译:摘要:非侵入式可懂度预测旨在无需干净参考信号的情况下,估计听力受损听众对助听器处理后语音的理解程度。本研究在第三届清晰度预测挑战赛中,利用两个冻结语音编码器——Canary与WavLM——对该任务进行探索。核心问题不仅在于是否应融合互补的预训练表征,更在于应于何处实现它们的交互。我们在共享左右声道保留双耳框架下,比较了单骨干基线、均匀分数平均、后期池化融合、交叉注意力、帧对齐融合及反向对齐等方法。在对比系统中,最佳模型通过可学习步进卷积对WavLM进行时间维度预处理,并在池化前于更粗粒度的Canary时间线上实现融合,最终达到评估集RMSE 24.96±0.06及相关系数0.796±0.001。严重程度、增强系统、层窗口及时间偏移分析表明,池化前的粗粒度局部时间对应关系为此任务提供了有效的归纳偏置。