Digestion Algorithm in Hierarchical Symbolic Forests: A Fast Text Normalization Algorithm and Semantic Parsing Framework for Specific Scenarios and Lightweight Deployment

规范化的 · 语义分析 · MoDELS · FAST · Neural Networks ·

2024 年 12 月 18 日

翻译：分层符号森林中的消化算法：一种面向特定场景与轻量部署的快速文本归一化算法及语义解析框架

Kevin You

from arxiv, 8 pages, 3 figures, 1 table

Text Normalization and Semantic Parsing have numerous applications in natural language processing, such as natural language programming, paraphrasing, data augmentation, constructing expert systems, text matching, and more. Despite the prominent achievements of deep learning in Large Language Models (LLMs), the interpretability of neural network architectures is still poor, which affects their credibility and hence limits the deployments of risk-sensitive scenarios. In certain scenario-specific domains with scarce data, rapidly obtaining a large number of supervised learning labels is challenging, and the workload of manually labeling data would be enormous. Catastrophic forgetting in neural networks further leads to low data utilization rates. In situations where swift responses are vital, the density of the model makes local deployment difficult and the response time long, which is not conducive to local applications of these fields. Inspired by the multiplication rule, a principle of combinatorial mathematics, and human thinking patterns, a multilayer framework along with its algorithm, the Digestion Algorithm in Hierarchical Symbolic Forests (DAHSF), is proposed to address these above issues, combining text normalization and semantic parsing workflows. The Chinese Scripting Language "Fire Bunny Intelligent Development Platform V2.0" is an important test and application of the technology discussed in this paper. DAHSF can run locally in scenario-specific domains on little datasets, with model size and memory usage optimized by at least two orders of magnitude, thus improving the execution speed, and possessing a promising optimization outlook.

翻译：文本归一化与语义解析在自然语言处理中具有广泛的应用，例如自然语言编程、复述生成、数据增强、专家系统构建、文本匹配等。尽管深度学习在大型语言模型（LLMs）中取得了显著成就，但神经网络架构的可解释性仍然较差，这影响了其可信度，从而限制了在风险敏感场景中的部署。在某些数据稀缺的特定场景领域，快速获取大量监督学习标签具有挑战性，人工标注数据的工作量将十分巨大。神经网络中的灾难性遗忘问题进一步导致数据利用率低下。在对快速响应要求极高的场景中，模型的稠密性使得本地部署困难且响应时间较长，不利于这些领域的本地化应用。受组合数学中的乘法原理及人类思维模式的启发，本文提出了一种结合文本归一化与语义解析工作流程的多层框架及其算法——分层符号森林中的消化算法（DAHSF），以解决上述问题。中文脚本语言“火兔智能开发平台 V2.0”是本文所述技术的重要测试与应用。DAHSF 能够在特定场景领域基于少量数据集本地运行，其模型大小与内存占用至少优化了两个数量级，从而提升了执行速度，并展现出良好的优化前景。