Zero pronouns (ZPs) are frequently omitted in pro-drop languages (e.g. Chinese, Hungarian, and Hindi), but should be recalled in non-pro-drop languages (e.g. English). This phenomenon has been studied extensively in machine translation (MT), as it poses a significant challenge for MT systems due to the difficulty in determining the correct antecedent for the pronoun. This survey paper highlights the major works that have been undertaken in zero pronoun translation (ZPT) after the neural revolution, so that researchers can recognise the current state and future directions of this field. We provide an organisation of the literature based on evolution, dataset, method and evaluation. In addition, we compare and analyze competing models and evaluation metrics on different benchmarks. We uncover a number of insightful findings such as: 1) ZPT is in line with the development trend of large language model; 2) data limitation causes learning bias in languages and domains; 3) performance improvements are often reported on single benchmarks, but advanced methods are still far from real-world use; 4) general-purpose metrics are not reliable on nuances and complexities of ZPT, emphasizing the necessity of targeted metrics; 5) apart from commonly-cited errors, ZPs will cause risks of gender bias.
翻译:零代词(ZPs)在代词省略型语言(如中文、匈牙利语和印地语)中频繁省略,但在非代词省略型语言(如英语)中必须被补全。这一现象在机器翻译(MT)领域得到了广泛研究,因为确定代词的正确先行词存在困难,对机器翻译系统构成了重大挑战。本综述论文重点介绍了神经革命以来零代词翻译(ZPT)领域的主要研究工作,以便研究者了解该领域的当前状态和未来方向。我们根据演进历程、数据集、方法和评估对文献进行了系统组织。此外,我们比较并分析了不同基准测试上的竞争模型和评估指标。我们揭示了若干富有洞见的发现,包括:1)ZPT与大语言模型的发展趋势保持一致;2)数据限制导致语言和领域上的学习偏差;3)性能提升往往仅在单一基准测试上报告,但先进方法仍远未达到实际应用水平;4)通用指标无法可靠反映ZPT在细微差别和复杂性上的要求,凸显了针对性指标的必要性;5)除常见错误外,ZP还会引发性别偏见风险。