Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.
翻译:阿拉伯语和希伯来语作为密切相关的闪米特语言,共享大量真同源词、具有误导性的假同源词以及现代借词。这种词汇重叠对大语言模型(LLMs)的跨语言语义理解构成挑战。为评估这一能力,我们提出了SemCog Bench基准——一个包含1,858组阿拉伯语-希伯来语词对的精选数据集,并配有用于同源词识别和语义消歧的句子级注释。我们评估了开源和商业LLMs在多种输入表示(原始形式、变音符号标注形式、拉丁化转写形式及音标形式)下的表现,揭示了跨语言推理中的关键差距。虽然模型在真同源词上达到高准确率,但在假同源词和借词上的性能急剧下降,反映出对表层形式相似性的强烈依赖。此外,句子级上下文仅带来微小改进,表明语境线索本身不足以克服基于形式误导的信号。这些发现揭示了当前LLMs在解决跨语言形式-意义冲突中的根本局限性,并将SemCog Bench确立为多语言语义推理的严格基准。我们的代码和数据已公开提供。