Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.
翻译:近期自然语言处理领域的进展得益于大型语言模型的出现,这些模型展现出卓越的生成与推理能力。然而,尽管取得显著成功,如何评估这些模型的真实语义理解能力仍是一项持续挑战。传统基准测试如词义消歧任务虽能有效探测该能力,但其构建过程资源密集且通常局限于高资源语言。本文提出SemBench框架——一种仅基于词典义项定义和句子编码器自动生成合成基准的方法,用于评估大语言模型的语义能力。该方法无需人工标注示例句子,兼具可扩展性与语言无关性。我们在三种语言(英语、西班牙语和巴斯克语)及不同语言资源层级下评估SemBench,并覆盖广泛的大语言模型。结果表明,SemBench生成的模型排名与标准词义消歧数据集结果高度相关。进一步分析显示,仅需少量示例即可获得稳定且具有显著意义的排名。总体而言,SemBench为跨语言语义理解评估提供了轻量级、可适应且数据高效的框架。