While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
翻译:尽管为跟上新模型及人工智能整体能力的快速发展,针对大语言模型(LLMs)的新基准测试不断涌现,但在非英语语言中使用和评估LLMs仍是一个探索较少的领域。本文首先简要概述了近期LLM基准测试的发展动态,随后提出了一种专为多语言或非英语使用场景设计的基准测试分类新体系。我们进一步提出了一套最佳实践与质量标准,以期推动欧洲语言基准测试更协调地发展。除其他建议外,我们主张评估方法应具备更高的语言与文化敏感性。