Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.
翻译:比喻语言理解对于大型语言模型而言仍是一个重大挑战,尤其在低资源语言中。为应对此问题,我们引入了一个新的习语数据集,这是一个大规模、基于文化的孟加拉语习语语料库,包含10,361条习语。每条习语均按照一个全面的19字段模式进行标注,该模式通过审慎的专家共识流程建立并完善,捕捉了其语义、句法、文化和宗教维度,为计算语言学提供了一个丰富且结构化的资源。为了为孟加拉语比喻语言理解建立一个稳健的基准,我们评估了30个最先进的多语言和指令微调LLM在推断比喻含义任务上的表现。我们的结果揭示了一个关键的性能差距,没有任何模型准确率超过50%,这与显著更高的人类表现(83.4%)形成鲜明对比。这突显了现有模型在跨语言和文化推理方面的局限性。通过发布新的习语数据集和基准,我们为推进孟加拉语及其他低资源语言在LLM中的比喻语言理解和文化基础提供了基础性基础设施。