Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real medical reports and specialized in-depth reasoning capabilities. In this work, we introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization, which poses several challenges: comprehensively interpreting imgage content across diverse challenging layouts, possessing numerical reasoning ability to identify abnormal indicators and demonstrating clinical reasoning ability to provide statements of disease diagnosis, status and advice based on medical contexts. We carefully design the data generation pipeline and proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed at restoring textual and tabular content in medical report images. This method substantially enhances annotation efficiency, doubling the productivity of each annotator, and yields a 26.8% improvement in accuracy. We conduct extensive evaluations, including few-shot assessments of 5 LMMs which are capable of solving Chinese medical QA tasks. To further investigate the limitations and potential of current LMMs, we conduct comparative experiments on a set of strong LLMs by using image-text generated by ESRA method. We report the performance of baselines and offer several observations: (1) The overall performance of existing LMMs is still limited; however LMMs more robust to low-quality and diverse-structured images compared to LLMs. (3) Reasoning across context and image content present significant challenges. We hope this benchmark helps the community make progress on these challenging tasks in multi-modal medical document understanding and facilitate its application in healthcare.
翻译:近期,大语言模型(LLMs)与大型多模态模型(LMMs)在智能医疗诊断等医学应用中展现出潜力。尽管已取得显著成果,但现有基准未能反映真实医学报告的复杂性与专业化深度推理能力。为此,我们提出RJUA-MedDQA——一个针对医学专业领域的综合性基准,其挑战包括:跨多样化复杂布局全面解读图像内容、具备异常指标数值推理能力、以及基于医学语境提供疾病诊断、状态评估与建议的临床推理能力。我们精心设计了数据生成流程,并提出高效结构化恢复标注方法(ESRA Method),旨在还原医学报告图像中的文本与表格内容。该方法显著提升标注效率,使每位标注员的产能翻倍,同时准确率提升26.8%。我们开展了广泛评估,包括对5种可处理中文医学问答任务的LMMs进行少样本测试。为深入探究当前LMMs的局限性及潜力,我们利用ESRA方法生成的图像-文本对,对一组强LLMs进行了对比实验。我们报告了基线性能,并提出以下观察:(1)现有LMMs的整体性能仍有限,但LMMs对低质量及多样结构图像的鲁棒性优于LLMs;(3)跨上下文与图像内容的推理面临重大挑战。希望本基准能推动多模态医学文档理解领域应对这些挑战性任务,并促进其在医疗健康中的实际应用。