Recent research using pre-trained transformer models suggests that just 10 minutes of transcribed speech may be enough to fine-tune such a model for automatic speech recognition (ASR) -- at least if we can also leverage vast amounts of text data (803 million tokens). But is that much text data necessary? We study the use of different amounts of text data, both for creating a lexicon that constrains ASR decoding to possible words (e.g. *dogz vs. dogs), and for training larger language models that bias the system toward probable word sequences (e.g. too dogs vs. two dogs). We perform experiments using 10 minutes of transcribed speech from English (for replicating prior work) and two additional pairs of languages differing in the availability of supplemental text data: Gronings and Frisian (~7.5M token corpora available), and Besemah and Nasal (only small lexica available). For all languages, we found that using only a lexicon did not appreciably improve ASR performance. For Gronings and Frisian, we found that lexica and language models derived from 'novel-length' 80k token subcorpora reduced the word error rate (WER) to 39% on average. Our findings suggest that where a text corpus in the upper tens of thousands of tokens or more is available, fine-tuning a transformer model with just tens of minutes of transcribed speech holds some promise towards obtaining human-correctable transcriptions near the 30% WER rule-of-thumb.
翻译:近期采用预训练Transformer模型的研究表明,仅需10分钟转录语音即可微调此类模型用于自动语音识别(ASR)——前提是还能利用海量文本数据(8.03亿词元)。但如此庞大的文本数据是否必要?本研究探究不同规模文本数据的作用,既包括创建约束ASR解码为合理词汇的词典(如*dogz对dogs),也包括训练更大语言模型使系统偏向可能词序(如too dogs对two dogs)。我们使用10分钟英语转录语音(复现前人工作)及另两组辅助文本数据可得性不同的语言对进行实验:格罗宁根语与弗里斯语(约750万词元语料库可用),以及贝塞马语与纳萨尔语(仅小规模词典可用)。对所有语言而言,我们发现仅使用词典并未显著提升ASR性能。针对格罗宁根语与弗里斯语,基于8万词元“小说级”子语料库构建的词典与语言模型,平均将词错误率(WER)降低至39%。我们的研究表明,当可用文本语料库包含数万词元以上时,仅用数十分钟转录语音微调Transformer模型,有望在接近30% WER经验阈值下获得可人工修正的转录文本。