IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://tinyurl.com/IndicSentEval}].

翻译：基于Transformer的模型已经彻底改变了自然语言处理领域。为了理解其卓越性能的原因并评估其可靠性，多项研究聚焦于以下问题：这些模型编码了哪些语言特性？编码程度如何？当输入文本受到扰动时，这些模型在编码语言特性方面具有怎样的鲁棒性？然而，现有研究主要集中于BERT模型和英语。本文针对6种印度语言，在13种不同扰动条件下，使用9种多语言Transformer模型（7种通用模型和2种印度语言专用模型），对8种语言特性的编码能力与鲁棒性进行了探究。为此，我们引入了一个新颖的多语言基准数据集IndicSentEval，包含约$\sim$47K个句子。令人惊讶的是，通过对表层、句法和语义属性的探测分析发现，尽管几乎所有多语言模型对英语都表现出稳定的编码性能，但对印度语言却呈现参差不齐的结果。正如预期，印度语言专用多语言模型在捕捉印度语言特性方面优于通用模型。有趣的是，通用模型总体上展现出比印度语言专用模型更好的鲁棒性，尤其在同时删除名词和动词、仅删除动词或仅保留名词等扰动条件下。总体而言，本研究为不同印度语言下主流多语言Transformer模型在探测任务和特定扰动方面的优势与局限提供了重要见解。我们的代码与数据集已公开提供[https://tinyurl.com/IndicSentEval}]。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日