Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.
翻译:长期以来,在线医疗论坛一直是患者寻求专业医疗建议的重要平台,产生了大量宝贵的知识。然而,论坛互动的非正式性质和语言复杂性给自动问答系统带来了重大挑战,尤其是在处理非英语语言时。我们提出了两个全面的意大利医学基准:\textbf{IMB-QA},包含来自77个医学类别的782,644个医患对话;以及\textbf{IMB-MCQA},包含来自医学专科考试的25,862个多项选择题。我们展示了如何利用大型语言模型(LLMs)来提高医疗论坛数据的清晰度和一致性,同时保留其原始含义和对话风格,并在开放式和多项选择题问答任务上比较了多种LLM架构。我们使用检索增强生成(RAG)和领域特定微调进行的实验表明,专门的适应策略在医学问答任务中可以超越更大规模的通用模型。这些发现表明,有效的医学人工智能系统可能更多地从领域专业知识和高效的信息检索中受益,而非仅仅依赖模型规模的扩大。我们在GitHub仓库中发布了两个数据集和评估框架,以支持多语言医学问答的进一步研究:https://github.com/PRAISELab-PicusLab/IMB。