We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-trained language model to utilize its built-in lexical and contextual knowledge, and study how to best use its embeddings so as to most effectively benefit our multi-task model. Through task-wise ablations, we show that our full model trained on all three tasks achieves the strongest overall performance compared to models trained on individual or sub-combinations of tasks, confirming the advantages of our MTL framework. Finally, we introduce a new HD dataset containing a balanced number of sentences in diverse contexts for a variety of homographs and their pronunciations. We demonstrate that incorporating this dataset into training significantly improves HD performance over only using a commonly used, but imbalanced, pre-existing dataset.
翻译:我们提出了一种多任务学习(MTL)模型,用于同时执行文本到语音(TTS)前端中常见的三个任务:文本归一化(TN)、词性(POS)标注和同形词消歧(HD)。我们的框架采用树状结构,主干部分学习共享表示,后接独立的特定任务头部。我们进一步融合预训练语言模型,利用其内置的词汇和上下文知识,并研究如何最优地使用其嵌入以最有效地提升多任务模型的性能。通过任务级消融实验,我们表明:与仅训练单个任务或部分任务组合的模型相比,同时训练三个任务的完整模型取得了最佳整体性能,证实了我们MTL框架的优势。最后,我们引入了一个新的HD数据集,该数据集包含针对多种同形词及其发音的均衡句子样本,覆盖不同上下文。实验证明,将这一数据集纳入训练可显著提升HD性能,效果优于仅使用一个常用但存在数据不平衡的现有数据集。