This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.
翻译:本研究首次对古希腊语技术散文的大语言模型机器翻译进行了系统性、无参考的人工评估。我们评估了三种商用大语言模型对希腊医生盖伦两部著作中二十段段落长度文本的翻译:《论混合物》和《论按种类配制药物》。前者已有两种已出版的英译本,后者则从未被完整翻译成英语。我们采用标准自动评估指标和专家人工评估相结合的方式,通过领域专家团队应用改进的多维质量度量框架对所有60份译文进行评估。在已有译本的说明性文本上,大语言模型取得了较高的翻译质量,平均MQM得分达95.2/100,表现接近专家水平。在未翻译的药理学文本上,整体质量较低,但存在较大方差,其中两个术语密度极高的段落导致质量骤降;排除这两个段落后,得分与已翻译文本的差距缩小至4分以内。通过文学性Diorisis古希腊语语料库中的语料频率量化的术语稀有度,成为翻译失败的强预测指标。自动评估指标在质量分布较广的未翻译文本上与人工评估总体呈中等相关性,但所有指标均无法有效区分高质量译文。我们讨论了大语言模型在古典学研究中的应用启示,以及针对低资源古代语言的自动评估流程设计。