We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.
翻译:我们研究了CPC3背景下听力受损听众的文本辅助语音可懂度预测问题。尽管目标是句子级百分比,但该指标由参考词识别结果决定。我们将预测任务建模为以参考转录为条件的词级正确性建模:冻结的Whisper编码器分析退化语音,教师强制解码器以规范转录为条件,通过对有效参考词的预测正确性概率取平均得到句子可懂度。为补充转录条件解码器的状态,我们增加了基于字符级交叉注意力对齐的词对齐局部声学分支,以及用于校准的语句级全局声学分支。在官方评估集上,解码器基线获得RMSE 24.92和相关系数0.795,而联合融合将错误词F1提升至0.778、MCC 0.626、相关系数0.806、RMSE下降至24.39。使用Whisper medium模型时呈现相似趋势,表明性能提升源于预测粒度与对齐感知融合。