This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.
翻译:本文提出一种半自动方法,用于构建基于说话人年龄、性别及录音时段均衡的历时语音语料库,涵盖32个类别(2种性别、4个年龄段及4个录音时段)。语料选自法国国家视听研究所(INA),每类至少包含30名说话人(总计960名,目前仅找到874名)。针对每位说话人,通过自动化流水线从视听文档中提取语音片段,该流水线包括语音检测、背景音乐与重叠语音去除及说话人分割,用于向人工标注者呈现清晰的说话人片段以识别目标说话人。该流水线效果显著,将人工处理量降低至原来的十分之一。本文对自动处理质量及最终输出进行了评估,结果显示自动处理水平与现有技术相当,且输出的大部分选定片段具有高质量语音。该方法为构建大规模已知目标说话人语料库展现了潜力。