Digits micro-model for accurate and secure transactions

Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or open source (OpenAI's Whisper). Such ASR models are trained on hundreds of thousands of hours of audio data and require considerable resources to run. Despite recent progress large speech recognition models, we highlight the potential of smaller, specialized "micro" models. Such light models can be trained perform well on number recognition specific tasks, competing with general models like Whisper or Google STT while using less than 80 minutes of training time and occupying at least an order of less memory resources. Also, unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets, which makes them highly accurate, agile, and easy to retrain, while using low compute resources. We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data. An added advantage, their low resource consumption allows them to be hosted on-premise, keeping private data local instead uploading to an external cloud. Our results indicate that our micro-model makes less errors than the best-of-breed commercial or open-source ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8% error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our model versus 11 GB VRAM for Whisper).

翻译：自动语音识别（ASR）系统在金融领域被用于提升客户体验，通过支持自然语言理解并促进高效、直观的交互。随着ASR系统的日益普及，此类系统必须展现出极低的错误率。目前用于采集数字数据的主流ASR模型均为大规模通用型商业模型——如谷歌语音转文本（Google Speech-to-text, STT）或Amazon Transcribe——或开源模型（如OpenAI的Whisper）。这些ASR模型在数十万小时的音频数据上训练，运行需要大量资源。尽管近期大型语音识别模型取得了进展，我们仍强调更小型的专业化"微型"模型的潜力。这类轻量模型能够在数字识别等特定任务上表现优异，仅需不到80分钟的训练时间，并占用至少一个数量级更少的内存资源，即可与Whisper或谷歌STT等通用模型相抗衡。此外，与大型语音识别模型不同，微型模型在精心筛选和整理的数据集上训练，使其具备高精度、敏捷性且易于重新训练，同时计算资源消耗低。我们展示了针对多位数数字识别创建的微型模型工作，该模型能够处理反映真实世界发音模式的各种语体。我们的研究为领域特定ASR模型做出贡献，提升了数字识别精度和数据隐私性。另一个优势在于，其低资源消耗使其可部署于本地服务器，将私有数据保留在本地而非上传至外部云端。结果表明，我们的微型模型在数字识别中的错误率低于商业或开源ASR模型中的佼佼者（最佳微型模型错误率为1.8%，而Whisper为5.8%），且内存占用极低（我们的模型需0.66 GB VRAM，Whisper需11 GB VRAM）。