As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.
翻译:随着大语言模型在医疗健康领域的应用日益增多,在临床使用前仔细评估其医疗安全性变得至关重要。然而,现有的安全性基准仍主要集中于英语,并且尽管临床咨询多为多轮对话,测试却仅使用单轮提示。为弥补这些不足,我们推出了JMedEthicBench,这是首个用于评估日语医疗场景下大语言模型医疗安全性的多轮对话基准。我们的基准基于日本医师协会的67项指导方针,并包含了使用七种自动发现的越狱策略生成的超过50,000条对抗性对话。采用双LLM评分协议,我们对27个模型进行了评估,发现商业模型保持了稳健的安全性,而医疗专用模型则表现出更高的脆弱性。此外,安全性得分在多轮对话中显著下降(中位数:从9.5降至5.0,$p < 0.001$)。对我们的基准的日语和英语版本进行的跨语言评估表明,医疗模型的脆弱性在不同语言间持续存在,这表明了其固有的对齐局限性,而非语言特定因素。这些发现表明,领域特定的微调可能会意外削弱安全机制,并且多轮交互代表了一个独特的威胁面,需要专门的对齐策略。