This study presents a domain adaptation approach for speaker diarization targeting conversational Indonesian audio. We address the challenge of adapting an English-centric diarization pipeline to a low-resource language by employing synthetic data generation using neural Text-to-Speech technology. Experiments were conducted with varying training configurations, a small dataset (171 samples) and a large dataset containing 25 hours of synthetic speech. Results demonstrate that the baseline \texttt{pyannote/segmentation-3.0} model, trained on the AMI Corpus, achieves a Diarization Error Rate (DER) of 53.47\% when applied zero-shot to Indonesian. Domain adaptation significantly improves performance, with the small dataset models reducing DER to 34.31\% (1 epoch) and 34.81\% (2 epochs). The model trained on the 25-hour dataset achieves the best performance with a DER of 29.24\%, representing a 13.68\% absolute improvement over the baseline while maintaining 99.06\% Recall and 87.14\% F1-Score.
翻译:本研究提出了一种针对印尼语对话音频的说话人日志领域自适应方法。我们通过利用神经文本转语音技术生成合成数据,解决了将英语中心化的日志管道适配到低资源语言所面临的挑战。实验采用了不同的训练配置,包括一个小型数据集(171个样本)和一个包含25小时合成语音的大型数据集。结果表明,在AMI语料库上训练的基线模型 \texttt{pyannote/segmentation-3.0} 在零样本应用于印尼语时,其说话人日志错误率为53.47%。领域自适应显著提升了性能:小型数据集训练的模型将错误率降低至34.31%(1轮训练)和34.81%(2轮训练)。在25小时数据集上训练的模型取得了最佳性能,错误率为29.24%,相比基线实现了13.68%的绝对提升,同时保持了99.06%的召回率和87.14%的F1分数。