The process of key information extraction is critical for converting scanned receipts into structured, accessible documents, facilitating the efficient retrieval of vital data. This research introduces an expansive, novel multilingual dataset designed to propel advancements in the domain of receipt information extraction and item classification. Our dataset encompasses 47,720 annotated samples, detailed with item names, associated attributes such as price and brand, and organized into 44 distinct product categories. We unveil the InstructLLaMA methodology, a pioneering approach that demonstrates significant effectiveness, evidenced by an F1 score of 0.76 and an accuracy of 0.68 in tasks of key information extraction and item classification. To support further research and application development, we make available our comprehensive dataset, the InstructLLaMA model, and relevant resources at https://github.com/Update-For-Integrated-Business-AI/AMuRD.
翻译:关键信息抽取是将扫描收据转化为结构化可访问文档、促进重要数据高效检索的关键过程。本研究提出一个规模宏大、新型的多语言数据集,旨在推动收据信息抽取与商品分类领域的发展。该数据集包含47,720个标注样本,涵盖商品名称及其价格、品牌等关联属性,并归入44个不同产品类别。我们开创性地提出InstructLLaMA方法,实验表明该方法在关键信息抽取与商品分类任务中表现显著有效,F1分数达0.76,准确率达0.68。为支持后续研究与应用开发,我们在https://github.com/Update-For-Integrated-Business-AI/AMuRD上公开了该综合数据集、InstructLLaMA模型及相关资源。