In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at https://github.com/sjtu-compling/MELA.
翻译:在这项工作中,我们提出了迄今为止最大的语言可接受性基准:多语言语言可接受性评估——MELA,包含46K个样本,涵盖来自多种语系的10种语言。我们在该基准上建立了大语言模型基线,并使用XLM-R研究了可接受性判断中的跨语言迁移。为了追求多语言可解释性,我们利用微调后的XLM-R进行了探针实验,以探索句法能力的习得过程。结果表明,GPT-4o展现出强大的多语言能力,优于微调后的XLM-R,而开源多语言模型则明显落后。跨语言迁移实验表明,可接受性判断中的迁移并非易事:500个冰岛语微调样本在完全无关的语言——中文上,仅带来23 MCC的性能提升。探针实验的结果表明,在MELA上训练提高了XLM-R在句法相关任务上的表现。我们的数据可在https://github.com/sjtu-compling/MELA获取。