Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.
翻译:强制对齐是一种将音频与正字及音标转写对齐的常用工具。大多数强制对齐工具仅提供边界的单一估计值。本项目提出了一种利用神经网络集成技术推导这些边界置信区间的方法。先前已训练了十个不同的分段分类器神经网络,并使用每个模型重复执行对齐过程。随后利用对齐集成将边界置于集成边界的中位数位置,并通过顺序统计量构建97.85%的置信区间。置信区间的建立为边界定位的不确定性提供了估计,有助于识别需要复核的边界。额外发现是,在Buckeye和TIMIT语料库上,集成边界相较于单一模型显示出轻微的整体改进。置信区间可在对齐过程中以JSON文件格式输出,并生成主表供程序化与统计分析使用。为保持使用习惯,亦可通过Praat TextGrids的点层级格式输出区间数据。