Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance. However, these documents are typically not born digital; they often exist as scanned PDFs or photographs captured under uncontrolled conditions. Consequently, they exhibit significant content heterogeneity, ranging from typed invoices to handwritten medical reports, as well as linguistic diversity. This challenge is exemplified by operations at Fullerton Health, which handles tens of millions of claims annually across nine markets, including Singapore, the Philippines, Indonesia, Malaysia, Mainland China, Hong Kong, Vietnam, Papua New Guinea, and Cambodia. Such variability, coupled with inconsistent image quality and diverse layouts, poses a significant obstacle to automated parsing and structured information extraction. This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression classifier, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B, to achieve efficient and accurate field extraction from large-scale claims data. The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document. Compared to manual processing, which typically requires around 10 minutes per claim, our system delivers a 300x improvement in efficiency. These results demonstrate that combining traditional machine learning models with modern VLMs enables production-grade accuracy and speed for real-world automation. The solution has been successfully deployed in our mobile application and is currently processing tens of thousands of claims weekly from Vietnam and Singapore.
翻译:理赔单据是医疗保健和保险业务的基础,作为报销、审计和合规的依据。然而,这些单据通常并非天生数字化;它们往往以扫描的PDF文件或在非受控条件下拍摄的照片形式存在。因此,这些单据表现出显著的内容异质性,从打印的发票到手写的医疗报告,以及语言多样性。Fullerton Health的业务运营便是这一挑战的例证,该公司每年在九个市场(包括新加坡、菲律宾、印度尼西亚、马来西亚、中国大陆、香港、越南、巴布亚新几内亚和柬埔寨)处理数千万份理赔单据。这种多变性,加上不一致的图像质量和多样化的版式,对自动化解析和结构化信息提取构成了重大障碍。本文提出了一种稳健的多阶段处理流程,该流程集成了多语言光学字符识别(OCR)引擎PaddleOCR、传统的逻辑回归分类器以及紧凑型视觉语言模型(VLM)Qwen 2.5-VL-7B,以实现从大规模理赔数据中高效、准确地提取字段。所提出的系统实现了超过95%的单据类型分类准确率和约87%的字段级提取准确率,同时保持每份单据平均处理延迟低于2秒。与通常每份理赔单据需要约10分钟的手动处理相比,我们的系统效率提升了300倍。这些结果表明,将传统机器学习模型与现代VLM相结合,能够为现实世界的自动化提供生产级的准确性和速度。该解决方案已成功部署在我们的移动应用程序中,目前每周处理来自越南和新加坡的数万份理赔单据。