Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.
翻译:自动发音评估在计算机辅助发音训练系统中起着至关重要的作用。由于能够同时执行多种发音任务,多维度多粒度的发音评估方法正逐渐受到更多关注,并取得了比单层级建模任务更优的性能。然而,现有方法仅考虑相邻粒度层级之间的单向依赖关系,缺乏音素、单词和话语层级之间的双向交互,因此未能充分捕捉声学结构相关性。为解决此问题,我们提出了一种新颖的残差分层交互方法,简称 HIA,该方法能够实现跨粒度的双向建模。作为 HIA 的核心,交互注意力模块利用注意力机制实现动态双向交互,有效捕获每个粒度层级的语言学特征,同时整合不同粒度层级之间的相关性。我们还提出了一种残差分层结构,以缓解在建模声学层级时出现的特征遗忘问题。此外,我们使用一维卷积层来增强对每个粒度层级局部上下文线索的提取。在 speechocean762 数据集上进行的大量实验表明,我们的模型全面领先于现有的最先进方法。