We present COPAL-ID, a novel, public Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID. In addition, we present COPAL-ID in both standard Indonesian and in Jakartan Indonesian-a dialect commonly used in daily conversation. COPAL-ID poses a greater challenge for existing open-sourced and closed state-of-the-art multilingual language models, yet is trivially easy for humans. Our findings suggest that general multilingual models struggle to perform well, achieving 66.91% accuracy on COPAL-ID. South-East Asian-specific models achieve slightly better performance of 73.88% accuracy. Yet, this number still falls short of near-perfect human performance. This shows that these language models are still way behind in comprehending the local nuances of Indonesian.
翻译:我们提出COPAL-ID,一个新颖的、公开的印度尼西亚语常识推理数据集。与先前发布的印尼COPA数据集(XCOPA-ID)不同,COPAL-ID融入了印尼本土文化与语境特质,从而更自然地呈现了印尼文化背景下的日常因果推理。该数据集由母语者从零开始专业编写,相较于翻译而成的XCOPA-ID,语言更流畅且无生硬表达。此外,COPAL-ID同时提供标准印尼语版本和雅加达方言版本——一种日常对话中广泛使用的方言。COPAL-ID对现有开源及闭源的多语言大模型构成更大挑战,但对人类而言却极为简单。研究发现,通用多语言模型表现欠佳,在COPAL-ID上仅达到66.91%的准确率;东南亚特定语言模型表现略优,准确率为73.88%,但仍远低于人类近乎完美的表现。这表明这些语言模型在理解印尼语本土语境方面仍有很大差距。