The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
翻译:开放Whisper风格语音模型(OWSM)系列旨在实现构建先进语音转文本基础模型的完全透明化。为此,OWSM模型在25个公开语音数据集上进行训练,这些数据集在多个方面呈现异构性。本研究通过引入OWSM v3.2推进了OWSM系列的发展,该版本通过探究并解决这种数据异构性的影响,在先前模型基础上实现了改进。研究首先对每个数据集进行详细分析,由此得出两项关键策略:采用代理任务进行数据过滤以提升数据质量,以及利用开放大语言模型(LLM)实现标点符号与大小写规范化的整合。在保持其他配置完全相同的条件下,OWSM v3.2相比OWSM v3.1基线模型,在使用训练数据减少15%的同时提升了性能。