This paper presents methods for extracting structured information from invoice documents and proposes a set of evaluation metrics (EM) to assess the accuracy of the extracted data against annotated ground truth. The approach involves pre-processing scanned or digital invoices, applying Docling and LlamaCloud Services to identify and extract key fields such as invoice number, date, total amount, and vendor details. To ensure the reliability of the extraction process, we establish a robust evaluation framework comprising field-level precision, consistency check failures, and exact match accuracy. The proposed metrics provide a standardized way to compare different extraction methods and highlight strengths and weaknesses in field-specific performance.
翻译:本文提出了从发票文档中提取结构化信息的方法,并提出了一套评估指标(EM)来评估提取数据相对于标注真实值的准确性。该方法涉及对扫描或数字发票进行预处理,应用Docling和LlamaCloud服务来识别和提取关键字段,如发票编号、日期、总金额和供应商详情。为确保提取过程的可靠性,我们建立了一个稳健的评估框架,包括字段级精确度、一致性检查失败率和完全匹配准确率。所提出的指标提供了一种标准化方式来比较不同的提取方法,并突显了在特定字段性能上的优势与不足。