Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed results and overestimation of model performance. As a consequence, human evaluation is gaining increasing interest as a means to assess the performance and reliability of models. One such method is the red teaming approach, which aims to generate edge cases where a model will produce critical errors. While this methodology is becoming standard practice for generative AI, its application to the realm of conditional AI remains largely unexplored. This paper presents the first study on human-based red teaming for Machine Translation (MT), marking a significant step towards understanding and improving the performance of translation models. We delve into both human-based red teaming and a study on automation, reporting lessons learned and providing recommendations for both translation models and red teaming drills. This pioneering work opens up new avenues for research and development in the field of MT.
翻译:自然语言处理领域的性能评估正变得日益复杂。其中一个关键挑战在于,评估数据集可能直接或间接与训练数据存在重叠,从而导致结果偏差,高估模型性能。因此,人工评估作为衡量模型性能与可靠性的手段正受到越来越多关注。红队测试方法便是其中之一,旨在生成模型会产生严重错误的极端案例。尽管这一方法已成为生成式人工智能的标准实践,但它在条件式人工智能领域的应用仍鲜有探索。本文首次开展了基于人工的机器翻译红队测试研究,标志着在理解与提升翻译模型性能方面迈出了重要一步。我们深入探讨了人工红队测试及自动化研究,总结了经验教训,并为翻译模型与红队测试实践提供了建议。这一开创性工作为机器翻译领域的研究与发展开辟了新路径。