Automatic pronunciation assessment is a major component of a computer-assisted pronunciation training system. To provide in-depth feedback, scoring pronunciation at various levels of granularity such as phoneme, word, and utterance, with diverse aspects such as accuracy, fluency, and completeness, is essential. However, existing multi-aspect multi-granularity methods simultaneously predict all aspects at all granularity levels; therefore, they have difficulty in capturing the linguistic hierarchy of phoneme, word, and utterance. This limitation further leads to neglecting intimate cross-aspect relations at the same linguistic unit. In this paper, we propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model, which hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention that reflects associations across aspects at the same level to create more connotative representations. By obtaining relational information from both the granularity- and aspect-side, HiPAMA can take full advantage of multi-task learning. Remarkable improvements in the experimental results on the speachocean762 datasets demonstrate the robustness of HiPAMA, particularly in the difficult-to-assess aspects.
翻译:自动发音评估是计算机辅助发音训练系统的重要组成部分。为提供深入反馈,需从音素、单词、语句等多粒度层级,以及准确度、流利度、完整性等多维度进行评分。然而,现有方法在多维度多粒度评估中往往同时预测所有粒度的所有维度,难以捕捉音素、单词、语句间的语言层次结构。这一局限进一步导致同一语言单元内不同维度间的密切关联被忽视。本文提出一种面向多维度注意力的分层发音评估模型(HiPAMA),通过分层表征不同粒度层级直接捕捉其语言结构,并引入多维度注意力机制反映同一层级内维度间的关联,从而构建更具内涵的表征。通过同时获取粒度和维度层面的关联信息,HiPAMA能够充分利用多任务学习的优势。在speachocean762数据集上的实验结果表明,HiPAMA在评估难度较高的维度上展现出显著性能提升与鲁棒性。