Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.
翻译:现有医疗对话系统大多采用单轮问答范式或依赖模板化数据集,限制了对话真实感与多语言适用性。我们提出IndicMedDialog,一个覆盖英语及九种印度语言(阿萨姆语、孟加拉语、古吉拉特语、印地语、马拉地语、旁遮普语、泰米尔语、泰卢固语和乌尔都语)的平行多轮医疗对话数据集。该数据集基于MDDial扩展,包含由大语言模型生成的合成会诊记录,经TranslateGemma翻译并由母语者验证,通过脚本感知后处理流水线修正语音、词汇和字符间距错误。基于该数据集,我们对量化小语言模型进行参数高效微调,训练得到IndicMedLM模型,并引入可选的患者背景信息以个性化多轮症状采集过程。我们将其与零样本多语言基线模型进行对比,对十种语言开展系统性错误分析,并通过医学专家评估验证临床合理性。