Forced alignment systems automatically determine boundaries between segments in speech data, given an orthographic transcription. These tools are commonplace in phonetics to facilitate the use of speech data that would be infeasible to manually transcribe and segment. In the present paper, we describe a new neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model in a forced aligner as a tagging task, rather than a classification task, motivated by the common understanding that segments in speech are not truly discrete and commonly overlap. The second is an interpolation technique to allow boundaries more precise than the common 10 ms limit in modern forced alignment systems. We compare configurations of our system to a state-of-the-art system, the Montreal Forced Aligner. The tagging approach did not generally yield improved results over the Montreal Forced Aligner. However, a system with the interpolation technique had a 27.92% increase relative to the Montreal Forced Aligner in the amount of boundaries within 10 ms of the target on the test set. We also reflect on the task and training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciliation of this tension may require rethinking the task and output targets or how speech itself should be segmented.
翻译:强制对齐系统能够根据正交转录自动确定语音数据中音段之间的边界。这类工具在语音学中广泛应用,用于处理人工转录和分割不可行的语音数据。本文描述了一种新型基于神经网络的强制对齐系统——Mason-Alberta语音分割器(MAPS)。MAPS对齐器作为我们探索强制对齐系统两项潜在改进的测试平台。第一项改进是将强制对齐器中的声学模型视为标注任务而非分类任务,其动机源于语音音段并非真正离散且时常重叠的普遍认知。第二项改进采用插值技术,使得边界精度超越现代强制对齐系统常见的10毫秒限制。我们将系统配置与先进系统蒙特利尔强制对齐器进行了比较。标注方法总体未比蒙特利尔强制对齐器产生更优结果。然而,采用插值技术的系统在测试集上,其落在目标边界10毫秒内的边界数量相对蒙特利尔强制对齐器提升了27.92%。我们还反思了强制对齐中声学建模的任务与训练过程,强调这些模型的输出目标与语音学家对音位间相似性的认知不匹配,并指出调和这一矛盾可能需要重新思考任务与输出目标,甚至语音本身应如何分段。