Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.
翻译:巴拿语作为一种在越南、柬埔寨和老挝使用的少数民族语言,由于研究资料与数据资源的匮乏,正面临严峻的保存挑战。本研究针对巴拿语文档通过光学字符识别技术实现精确数字化的迫切需求展开探讨。扫描纸质文档的数字化过程存在显著困难,因图像破损或模糊区域导致的图像质量退化会引发大量OCR错误,进而损害信息检索系统的可靠性。我们提出一种综合方法,将先进的表格与非表格检测技术与基于概率的后处理启发式策略相结合,以提升识别准确率。该方法首先应用检测算法改善输入数据质量,随后对OCR输出进行概率纠错处理。实验结果表明该方法取得显著改进,识别准确率从72.86%提升至79.26%。本研究不仅为巴拿语保护提供了宝贵资源,也为其他少数民族语言的数字化工作提供了可借鉴的框架。