Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces "textual similarity" as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found that textual similarity exhibits stronger correlations with human scores than traditional metrics (hter, model evaluation, sentence probability etc.). Employing GAMMs as a statistical tool, we demonstrated that textual similarity consistently outperforms other metrics across multiple language pairs in predicting human scores. We also found that "hter" actually failed to predict human scores in QE. Our findings highlight the effectiveness of textual similarity as a robust QE metric, recommending its integration with other metrics into QE frameworks and MT system training for improved accuracy and usability.
翻译:机器翻译质量评估无需参考译文即可评估翻译的可靠性。本研究引入"文本相似性"作为质量评估的新指标,利用句子Transformer和余弦相似度来衡量语义接近程度。通过分析MLQE-PE数据集,我们发现文本相似性与人工评分之间的相关性比传统指标(hter、模型评估、句子概率等)更强。采用广义可加混合模型作为统计工具,我们证明在预测人工评分时,文本相似性在多个语言对中持续优于其他指标。我们还发现"hter"指标在质量评估中实际上无法预测人工评分。我们的研究结果突显了文本相似性作为鲁棒性质量评估指标的有效性,建议将其与其他指标共同整合到质量评估框架和机器翻译系统训练中,以提高准确性和可用性。