Employing extensive datasets enables the training of multilingual machine translation models; however, these models often fail to accurately translate sentences within specialized domains. Although obtaining and translating domain-specific data incurs high costs, it is inevitable for high-quality translations. Hence, finding the most 'effective' data with an unsupervised setting becomes a practical strategy for reducing labeling costs. Recent research indicates that this effective data could be found by selecting 'properly difficult data' based on its volume. This means the data should not be excessively challenging or overly simplistic, especially if the amount of data is limited. However, we found that establishing a criterion for unsupervised data selection remains challenging, as the 'proper difficulty' might vary based on the data domain being trained on. We introduce a novel unsupervised data selection method, 'Capturing Perplexing Named Entities', which adopts the maximum inference entropy in translated named entities as a selection measure. The motivation was that named entities in domain-specific data are considered the most complex portion of the data and should be predicted with high confidence. When verified with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as a robust guidance for unsupervised data selection, in contrast to existing methods.
翻译:利用大规模数据集能够训练多语言机器翻译模型,但这些模型在翻译专业领域的句子时常出现偏差。尽管获取和翻译领域特定数据成本高昂,但对于高质量翻译而言却是不可避免的。因此,在无监督设置下寻找最具"有效性"的数据成为降低标注成本的实用策略。近期研究表明,通过基于数据量选择"适当困难程度的数据"(即数据不应过难或过于简单,特别是当数据量有限时)即可找到这些有效数据。然而我们发现,建立无监督数据选择标准仍具挑战性,因为"适当困难程度"可能因训练数据领域的不同而变化。我们提出了一种新型无监督数据选择方法——"捕捉令人困惑的命名实体",该方法以翻译后命名实体的最大推理熵作为选择指标。其动机在于:领域特定数据中的命名实体被视为数据中最复杂的部分,应以高置信度进行预测。当使用"韩英专业领域平行语料库"进行验证时,与现有方法相比,我们的方法为无监督数据选择提供了鲁棒的指导。