This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.
翻译:本文评估了当前针对冰岛语的大语言模型基准测试,指出了其中存在的问题,并呼吁改进低/中等资源语言(尤其是冰岛语)的评估方法。我们发现,那些包含未经任何验证的合成数据或机器翻译数据的基准测试,通常存在严重缺陷的测试样例,这些样例很可能扭曲结果并损害测试的有效性。我们警告在低/中等资源场景中不应未经核实就使用此类方法,因为其翻译质量充其量只能达到特定语言在特定时间点的机器翻译水平。事实上,我们对现有冰岛语基准测试的定量误差分析结果显示,人工撰写/翻译的基准测试与合成或机器翻译的基准测试之间存在明显差异。