Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges, this study proposes a supervised voicing detection model that leverages recorded laryngograph data. The model is based on a densely-connected convolutional recurrent neural network (DC-CRN), and trained on data with reference voicing decisions extracted from laryngograph data sets. Pretraining is also investigated to improve the generalization ability of the model. The proposed model produces robust voicing detection results, outperforming other strong baseline methods, and generalizes well to unseen datasets. The source code of the proposed model with pretraining is provided along with the list of used laryngograph datasets to facilitate further research in this area.
翻译:准确检测语音信号中的浊音区间是音高追踪的关键步骤,具有广泛应用。尽管传统信号处理方法和深度学习算法已针对该任务提出,但它们需要针对不同数据集精细调整阈值参数且泛化能力有限,从而限制了实际应用中的效用。为解决这些问题,本研究提出了一种利用记录的喉头仪数据的监督式浊音检测模型。该模型基于密集连接卷积循环神经网络(DC-CRN),并使用从喉头仪数据集中提取的参考浊音决策进行训练。此外,还研究了预训练方法以提升模型的泛化能力。所提出的模型能够生成稳健的浊音检测结果,优于其他强基线方法,并能良好地泛化至未见过的数据集。本文提供了所提模型及其预训练的源代码,并附上了使用的喉头仪数据集列表,以促进该领域的后续研究。