Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.
翻译:多语言自动语音识别(ASR)需要能够有效覆盖多种书写系统的分词方法。基于UTF-8的字节级字节对编码(BBPE)因其语言无关的设计和完整的Unicode覆盖而被广泛采用,但其变长编码会显著增加非拉丁文字(如中文、日文和韩文)的标记序列长度。更长的序列会增加计算负载和内存使用。我们提出了BBPE16,一种基于UTF-16的BBPE分词器,它使用统一的2字节代码单元来表示大多数现代文字。BBPE16保留了BBPE的语言无关特性,同时显著提升了跨语言标记共享能力。在单语、双语和三语ASR任务中,以及在多语言持续学习设置下,BBPE16取得了相当或更高的识别准确率;对于中文,其标记数量最多减少10.4%,解码迭代次数最多降低10.3%。这些减少加快了微调与推理速度,并降低了内存使用,使得BBPE16成为多语言ASR中一个实用的分词选择。