A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

from arxiv, 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

翻译：在混合自动语音识别（ASR）系统中，词汇量是明确无误的，通常由语言中存在的音素、双音素或三音素数量决定。相比之下，端到端ASR系统的词汇（通常称为token）源自用于训练文本语料库。词汇的选择，尤其是词汇量大小，是训练端到端ASR系统时关键的超参数。诸如Byte Pair Encoding（BPE）、WordPiece和Unigram Language Model（ULM）等分词算法将词汇量大小作为输入超参数，用于生成ASR训练过程中使用的子词单元。ESPNet等流行工具包在其训练方案中提供了固定的词汇量大小，但文献中鲜有关于这些数值如何确定的记录或讨论。近期研究[1]形式化了一种识别最适用于端到端ASR的词汇量大小的方法，引入了一个将分词过程视为黑箱的成本函数框架。本文在该基础上，通过对训练数据进行曲线拟合，并运用微积分中的一阶及二阶导数检验原理，形式化地估计词汇量大小这一超参数。我们通过在标准Librispeech语料库上应用该方法，展示了其效用和实用性，并表明词汇量大小超参数的最优选择能够提升ASR系统的性能。本文的主要贡献在于形式化了一种识别最适合训练端到端ASR系统的词汇量大小的方法。