Recent benchmarks for Large Language Models (LLMs) have mostly focused on application-driven tasks such as complex reasoning and code generation, and this has led to a scarcity in purely linguistic evaluation of LLMs. Against this background, we introduce Multilingual Evaluation of Linguistic Acceptability -- MELA, the first multilingual benchmark on linguistic acceptability with 48K samples covering 10 languages from a diverse set of language families. We establish baselines of commonly used LLMs along with supervised models, and conduct cross-lingual transfer and multi-task learning experiments with XLM-R. In pursuit of multilingual interpretability, we analyze the weights of fine-tuned XLM-R to explore the possibility of identifying transfer difficulty between languages. Our results show that ChatGPT benefits much from in-context examples but still lags behind fine-tuned XLM-R, while the performance of GPT-4 is on par with fine-tuned XLM-R even in zero-shot setting. Cross-lingual and multi-task learning experiments show that unlike semantic tasks, in-language training data is crucial in acceptability judgements. Results in layerwise probing indicate that the upper layers of XLM-R become a task-specific but language-agnostic region for multilingual acceptability judgment. We also introduce the concept of conflicting weight, which could be a potential indicator for the difficulty of cross-lingual transfer between languages. Our data will be available at https://github.com/sjtu-compling/MELA.
翻译:摘要:近期针对大型语言模型(LLM)的基准测试主要聚焦于应用驱动型任务,如复杂推理与代码生成,这导致对LLM纯语言维度的评估较为匮乏。在此背景下,我们提出多语言语言可接受性评估(MELA)——首个覆盖10种分属不同语系语言、包含48K样本的多语言语言可接受性基准。我们建立了常见LLM与监督模型的基线,并利用XLM-R开展跨语言迁移与多任务学习实验。为探索多语言可解释性,我们分析了微调后XLM-R的权重,以探究识别语言间迁移难度的可能性。结果表明:ChatGPT虽能显著受益于上下文示例,但表现仍落后于微调后的XLM-R;而GPT-4在零样本场景下即可达到与微调XLM-R相当的性能。跨语言与多任务学习实验显示,与语义任务不同,语内训练数据在可接受性判断中至关重要。分层探针分析表明,XLM-R上层区域在多语言可接受性判断中呈现任务特异但语言无关的特征。此外,我们提出“冲突权重”概念,该指标或可揭示语言间跨语言迁移的难度。相关数据将发布于https://github.com/sjtu-compling/MELA。