The advent of NMT has expanded the scope of translation beyond isolated sentences, enabling context to be preserved across paragraphs and documents. However, current evaluation metrics largely remain restricted to the sentence level and typically depend on reference translations. Without references, existing metrics cannot provide a clear basis for their quality assessments. To address these limitations, we propose an evaluation framework that independently extracts and compares latent topic structures within source and translated texts. This framework utilises various topic modelling techniques, including LSA, LDA and BERTopic, to achieve this. Our methodology captures statistical frequency information and semantic context, providing a comprehensive evaluation of the entire document. It aligns key topic tokens across languages using a bilingual dictionary and quantifies thematic consistency via cosine similarity. This allows us to evaluate how faithfully the translation maintains the thematic integrity of the source text, even in the absence of reference translations. To this end, we used a large scale dataset of 9.38 million Korean to English sentence pairs from AI Hub, which includes pre evaluated BLEU scores. We also calculated CometKiwi, a state of the art, reference free metric for this dataset, in order to conduct a comparative analysis with our proposed, topic based framework. Through this analysis, we confirmed that, unlike existing metrics, our framework evaluates the differentiated attribute of document level thematic units. Furthermore, visualising the key tokens that underpin the quantitative evaluation score provides clear insight into translation quality. Consequently, this study contributes to effectively complementing the existing translation evaluation system by proposing a new metric that intuitively identifies whether the document's theme has been preserved.
翻译:神经机器翻译的兴起将翻译范畴从孤立句子扩展至段落与文档层面,使语境得以跨文本单元延续。然而,当前评估指标仍主要局限于句子层级,且通常依赖参考译文。缺乏参考译文时,现有指标无法为质量评估提供明确依据。为突破这些局限,我们提出一种独立提取并比较源文本与译文潜在主题结构的评估框架。该框架运用包括潜在语义分析、隐含狄利克雷分配及BERTopic在内的多种主题建模技术,在捕获统计频率信息与语义上下文的同时,实现对整个文档的综合评估。通过双语词典对齐跨语言关键主题词条,并借助余弦相似度量化主题一致性,我们得以评估译文在无参考译文的情况下忠实保持源文本主题完整性的程度。为此,我们采用AI Hub提供的938万韩英句对大规模数据集(含预评估BLEU分值),同时计算该数据集的CometKiwi(一种先进的无参考评估指标),以与本研究提出的主题框架进行对比分析。通过对比验证,该框架能评估现有指标无法触及的文档级主题单元差异化属性。此外,可视化支撑量化评估分数的关键词条,可清晰揭示翻译质量。因此,本研究通过提出能直观识别文档主题是否得以保留的新指标,有效补充了现有翻译评估体系。