This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. Although robust evaluation frameworks are essential for the safe development of medical LLMs, resources in Japanese are scarce and often consist of translated multiple-choice questions. Our research addresses this issue in two ways. First, we establish a performance baseline by applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate two models: a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Secondly, we use an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria. This allows us to identify 'contextual gaps' where the content is misaligned with Japan's clinical guidelines, healthcare systems or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches, as well as a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification shows that, despite most scenarios being applicable, a significant proportion of the rubric criteria require localisation. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localised adaptation, a "J-HealthBench", to ensure the reliable and safe evaluation of medical LLMs in Japan.
翻译:本研究探讨了HealthBench(一种基于评分标准的大规模医学基准)在日本背景下的适用性。尽管稳健的评估框架对于医疗大语言模型的安全发展至关重要,但日语资源稀缺,且通常由翻译的多选题构成。我们的研究通过两种方式解决这一问题。首先,我们通过应用机器翻译版的HealthBench 5,000个场景来建立性能基线,评估了两个模型:一个高性能多语言模型(GPT-4.1)和一个日语原生开源模型(LLM-jp-3.1)。其次,我们采用LLM-as-a-Judge方法系统分类基准的场景和评分标准。这使得我们能够识别出内容与日本临床指南、医疗体系或文化规范不匹配的“情境差距”。我们的研究结果显示,由于评分标准不匹配,GPT-4.1的性能略有下降,而日语原生模型则因缺乏必要的临床完整性而出现显著失败。此外,我们的分类表明,尽管大多数场景适用,但相当一部分评分标准需要本地化。这项工作强调了直接翻译基准的局限性,并凸显了迫切需要一个情境感知的本地化适配版本——“J-HealthBench”,以确保在日本可靠且安全地评估医疗大语言模型。