We present VietNormalizer1, an open-source, zero-dependency Python library for Vietnamese text normalization targeting Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Vietnamese text normalization is a critical yet underserved preprocessing step: real-world Vietnamese text is densely populated with non-standard words (NSWs), including numbers, dates, times, currency amounts, percentages, acronyms, and foreign-language terms, all of which must be converted to fully pronounceable Vietnamese words before TTS synthesis or downstream language processing. Existing Vietnamese normalization tools either require heavy neural dependencies while covering only a narrow subset of NSW classes, or are embedded within larger NLP toolkits without standalone installability. VietNormalizer addresses these gaps through a unified, rule-based pipeline that: (1) converts arbitrary integers, decimals, and large numbers to Vietnamese words; (2) normalizes dates and times to their spoken Vietnamese forms; (3) handles VND and USD currency amounts; (4) expands percentages; (5) resolves acronyms via a customizable CSV dictionary; (6) transliterates non-Vietnamese loanwords and foreign terms to Vietnamese phonetic approximations; and (7) performs Unicode normalization and emoji/special-character removal. All regular expression patterns are pre-compiled at initialization, enabling high-throughput batch processing with minimal memory overhead and no GPU or external API dependency. The library is installable via pip install vietnormalizer, available on PyPI and GitHub at https://github.com/nghimestudio/vietnormalizer, and released under the MIT license. We discuss the design decisions, limitations of existing approaches, and the generalizability of the rule-based normalization paradigm to other low-resource tonal and agglutinative languages.
翻译:本文介绍VietNormalizer1,一个面向文本转语音(TTS)与自然语言处理(NLP)应用的开源、零依赖越南语文本归一化Python库。越南语文本归一化是一个关键但长期缺乏专用工具的前处理步骤:现实场景中的越南语文本包含大量非标准词汇,包括数字、日期、时间、货币金额、百分比、缩写词及外语术语,这些内容在TTS合成或下游语言处理前均需转换为可完整发音的越南语词汇。现有越南语归一化工具或需依赖复杂神经网络却仅覆盖有限非标准词类别,或嵌入大型NLP工具包而无法独立安装。VietNormalizer通过统一的基于规则的流水线解决上述问题,其功能包括:(1)将任意整数、小数及大数转换为越南语词汇;(2)将日期与时间规范化为越南语口语形式;(3)处理越南盾与美元货币金额;(4)展开百分比表达;(5)通过可定制CSV词典解析缩写词;(6)将非越南语借词及外语术语音译为越南语近似发音;(7)执行Unicode规范化及表情符号/特殊字符移除。所有正则表达式模式均在初始化时预编译,支持高吞吐量批处理且内存开销极小,无需GPU或外部API依赖。该库可通过pip install vietnormalizer安装,发布于PyPI与GitHub(https://github.com/nghimestudio/vietnormalizer),采用MIT许可证。文中讨论了设计决策、现有方法的局限性,以及基于规则的归一化范式对其他低资源声调语言与黏着语言的普适性。