Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.
翻译:物理常识推理代表了人类智能的一项基本能力,使个体能够理解其环境、预测未来事件并在物理空间中导航。近年来,自然语言处理领域对推理任务的兴趣日益增长。然而,尚无先前研究考察大型语言模型在巴斯克语等低资源语言的非问答式物理常识推理任务上的表现。本文以意大利语GITA数据集为起点,通过提出BasPhyCo来填补这一空白,这是首个面向巴斯克语的非问答式物理常识推理数据集,同时提供标准语和方言变体。我们在三个层次的常识理解上评估模型性能:(1) 区分合理与不合理叙述(准确性),(2) 识别导致叙述不合理的冲突元素(一致性),以及(3) 确定造成不合理性的具体物理状态(可验证性)。这些任务通过使用多种多语言LLM以及专门针对意大利语和巴斯克语预训练的模型进行评估。结果表明,就可验证性而言,LLM在巴斯克语等低资源语言中展现出有限的物理常识能力,尤其是在处理方言变体时。