Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.

翻译：背景：放射学报告的准确翻译对于多语言研究、临床沟通和放射学教育至关重要，但基于大语言模型（LLM）评估的有效性尚不明确。目的：评估LLM生成的胸部CT报告日文翻译的教育适用性，并比较放射科医生评估与LLM作为评判者的评估结果。方法：我们分析了来自CT-RATE-JPN验证集的150份胸部CT报告。针对每份英文报告，将人工编辑的日文翻译与DeepSeek-V3.2生成的LLM翻译进行比较。一名委员会认证的放射科医生和一名放射科住院医生独立进行盲审配对评估，涉及4项标准：术语准确性、可读性、整体质量和放射科医生风格的逼真度。同时，3个LLM评判者（DeepSeek-V3.2、Mistral Large 3和GPT-5）评估了相同的配对。一致性评估采用加权卡帕系数（QWK）和百分比一致性。结果：放射科医生与LLM评判者之间的一致性近乎为零（QKW=-0.04至0.15）。两名放射科医生之间的一致性也较差（QKW=0.01至0.06）。放射科医生1在59%的病例中评定术语为等效，在可读性（51%）和整体质量（51%）方面偏向LLM翻译。放射科医生2在75%的病例中评定可读性为等效，在整体质量方面偏向人工编辑翻译（40%对21%）。所有3个LLM评判者在所有标准中强烈偏向LLM翻译（70%-99%），并在超过93%的病例中评定其更接近放射科医生风格。结论：LLM生成的翻译通常被视为自然流畅，但两名放射科医生的评估存在显著差异。LLM作为评判者显示出对LLM输出的强烈偏好，并与放射科医生的一致性可忽略不计。对于翻译放射学报告的教育用途，仅依赖自动化的LLM评估是不够的；专家放射科医生的审查仍然至关重要。