ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema

Objective: This study introduces ChatSchema, an effective method for extracting and structuring information from unstructured data in medical paper reports using a combination of Large Multimodal Models (LMMs) and Optical Character Recognition (OCR) based on the schema. By integrating predefined schema, we intend to enable LMMs to directly extract and standardize information according to the schema specifications, facilitating further data entry. Method: Our approach involves a two-stage process, including classification and extraction for categorizing report scenarios and structuring information. We established and annotated a dataset to verify the effectiveness of ChatSchema, and evaluated key extraction using precision, recall, F1-score, and accuracy metrics. Based on key extraction, we further assessed value extraction. We conducted ablation studies on two LMMs to illustrate the improvement of structured information extraction with different input modals and methods. Result: We analyzed 100 medical reports from Peking University First Hospital and established a ground truth dataset with 2,945 key-value pairs. We evaluated ChatSchema using GPT-4o and Gemini 1.5 Pro and found a higher overall performance of GPT-4o. The results are as follows: For the result of key extraction, key-precision was 98.6%, key-recall was 98.5%, key-F1-score was 98.6%. For the result of value extraction based on correct key extraction, the overall accuracy was 97.2%, precision was 95.8%, recall was 95.8%, and F1-score was 95.8%. An ablation study demonstrated that ChatSchema achieved significantly higher overall accuracy and overall F1-score of key-value extraction, compared to the Baseline, with increases of 26.9% overall accuracy and 27.4% overall F1-score, respectively.

翻译：目的：本研究介绍了ChatSchema，一种基于模式，结合大型多模态模型和光学字符识别，从医学论文报告的非结构化数据中有效提取和结构化信息的方法。通过整合预定义的模式，我们旨在使大型多模态模型能够直接根据模式规范提取和标准化信息，从而促进后续的数据录入。方法：我们的方法包含一个两阶段流程，包括用于分类报告场景和结构化信息的分类与提取。我们建立并标注了一个数据集以验证ChatSchema的有效性，并使用精确率、召回率、F1分数和准确率指标评估了关键信息提取。基于关键信息提取，我们进一步评估了值提取。我们对两种大型多模态模型进行了消融研究，以说明不同输入模态和方法对结构化信息提取的改进。结果：我们分析了来自北京大学第一医院的100份医学报告，并建立了一个包含2,945个键值对的真实数据集。我们使用GPT-4o和Gemini 1.5 Pro评估了ChatSchema，发现GPT-4o的整体性能更高。结果如下：对于关键信息提取的结果，关键信息精确率为98.6%，关键信息召回率为98.5%，关键信息F1分数为98.6%。对于基于正确关键信息提取的值提取结果，整体准确率为97.2%，精确率为95.8%，召回率为95.8%，F1分数为95.8%。一项消融研究表明，与基线相比，ChatSchema在键值提取的整体准确率和整体F1分数上均显著更高，分别提高了26.9%的整体准确率和27.4%的整体F1分数。