As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.
翻译:随着大语言模型在医疗领域的应用日益广泛,在临床使用前仔细评估其医疗安全性变得至关重要。然而,现有安全基准仍以英语为中心,且仅使用单轮提示进行测试,而临床咨询通常涉及多轮对话。为解决这些问题,我们提出JMedEthicBench——首个用于评估日本医疗大语言模型安全性的多轮对话基准。该基准基于日本医师协会的67项指南,包含超过50,000条对抗性对话,这些对话通过七种自动发现的越狱策略生成。采用双大语言模型评分协议,我们评估了27个模型,发现商业模型保持稳健的安全性,而医疗专用模型的脆弱性增加。此外,安全评分随对话轮次显著下降(中位数:9.5至5.0,p < 0.001)。基于基准日英双语版本的跨语言评估显示,医疗模型的脆弱性跨语言持续存在,表明其对齐局限性本质上是固有的,而非语言特定因素所致。这些发现表明,领域特定的微调可能意外削弱安全机制,而多轮交互构成需要专门对齐策略的独特威胁面。