Does Transliteration Help Multilingual Language Modeling? - 专知论文

会员服务 ·

0

语言建模 · 语料库 · 语料 · 基 · 多样性 ·

2023 年 3 月 27 日

Does Transliteration Help Multilingual Language Modeling?

翻译：音译对多语言语言建模有帮助吗？

Ibraheem Muhammad Moosa,Mahmud Elahi Akhter,Ashfia Binte Habib

As there is a scarcity of large representative corpora for most languages, it is important for Multilingual Language Models (MLLM) to extract the most out of existing corpora. In this regard, script diversity presents a challenge to MLLMs by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. In this paper, we pretrain two ALBERT models to empirically measure the effect of transliteration on MLLMs. We specifically focus on the Indo-Aryan language family, which has the highest script diversity in the world. Afterward, we evaluate our models on the IndicGLUE benchmark. We perform Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity (CLRS) of the models using centered kernel alignment (CKA) on parallel sentences of eight languages from the FLORES-101 dataset. We find that the hidden representations of the transliteration-based model have higher and more stable CLRS scores. Our code is available at Github (github.com/ibraheem-moosa/XLM-Indic) and Hugging Face Hub (huggingface.co/ibraheemmoosa/xlmindic-base-multiscript and huggingface.co/ibraheemmoosa/xlmindic-base-uniscript).

翻译：由于大多数语言缺乏大规模代表性语料库，多语言语言模型（MLLM）必须从现有语料中最大化信息提取效率。在此背景下，文字多样性因降低近亲语言间的词汇重叠度而对MLLM构成挑战。因此，将使用不同书写系统的近亲语言统一音译至同一脚本，可能提升MLLM的下游任务性能。本文预训练了两个ALBERT模型，通过实验量化音译对MLLM的影响，重点关注全球文字多样性最高的印度-雅利安语系。随后，我们在IndicGLUE基准上评估模型，并采用Mann-Whitney U检验严格验证音译效果的显著性。实验表明：音译在未对资源丰富语言产生负面影响的前提下，有效提升了低资源语言的表现。我们进一步利用FLORES-101数据集中的八种语言平行句，通过中心核对齐（CKA）方法测量模型的跨语言表征相似度（CLRS），发现基于音译的模型隐藏层表征具有更高且更稳定的CLRS得分。相关代码已开源至GitHub（github.com/ibraheem-moosa/XLM-Indic）与Hugging Face Hub（huggingface.co/ibraheemmoosa/xlmindic-base-multiscript及huggingface.co/ibraheemmoosa/xlmindic-base-uniscript）。

0

相关内容

语言建模

【ACL2021】ERICA:通过对比学习提高预训练语言模型的实体和关系理解

专知会员服务

26+阅读 · 2021年8月12日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【知识图谱@ACL2020】Knowledge Graphs in Natural Language Processing

【知识图谱@ACL2020】Knowledge Graphs in Natural Language Processing

专知会员服务

66+阅读 · 2020年7月12日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

专知会员服务

25+阅读 · 2019年12月21日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

Github项目推荐 | awesome-bert：BERT相关资源大列表

Github项目推荐 | awesome-bert：BERT相关资源大列表

AI研习社

27+阅读 · 2019年2月26日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

VSMC源性血管紧张素转化酶通过C-结构域促动脉粥样硬化的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥DIF（DRIP1-Interacting Factor）在胁迫信号应答中的功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

解构近边吸收谱

国家自然科学基金

0+阅读 · 2012年12月31日

拟南芥光敏色素A的蛋白磷酸化调控其信号传导的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

双语者句子理解过程中句法加工的认知/神经时间动态性

国家自然科学基金

0+阅读 · 2012年12月31日

NUAK1介导LKB1信号通路与PTEN信号通路相互作用的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

线粒体tRNA前体加工与细胞周期调控之间偶联机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

基于潜在语义对偶空间的跨语言信息检索理论和算法研究

国家自然科学基金

1+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

Chain-of-Dictionary Prompting Elicits Translation in Large Language Models

Arxiv

0+阅读 · 2023年5月17日

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Arxiv

0+阅读 · 2023年5月16日

What Makes Pre-trained Language Models Better Zero-shot Learners?

Arxiv

0+阅读 · 2023年5月16日

A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization

Arxiv

0+阅读 · 2023年5月16日

Soft Prompt Decoding for Multilingual Dense Retrieval

Arxiv

0+阅读 · 2023年5月15日

MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

Arxiv

0+阅读 · 2023年5月15日

Towards Transliteration between Sindhi Scripts from Devanagari to Perso-Arabic

Arxiv

0+阅读 · 2023年5月12日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月12日

Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond

Arxiv

21+阅读 · 2021年9月2日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

VIP会员

文章信息

相关主题

最新内容

《越野作战环境下路径规划的多准则整数规划模型》

《越野作战环境下路径规划的多准则整数规划模型》

专知会员服务

2+阅读 · 12分钟前

人工智能大语言模型引擎如何重塑全球冲突信息环境最新50页

人工智能大语言模型引擎如何重塑全球冲突信息环境最新50页

专知会员服务

2+阅读 · 18分钟前

《防空系统对自主武器系统辩论中“有意义的人类控制”的启示》70页报告

《防空系统对自主武器系统辩论中“有意义的人类控制”的启示》70页报告

专知会员服务

2+阅读 · 25分钟前

“对标ChatGPT”：乌军研发Marichka AI系统用于战场筹划

“对标ChatGPT”：乌军研发Marichka AI系统用于战场筹划

专知会员服务

2+阅读 · 29分钟前

《同步多无人机系统中的故障与通信》

《同步多无人机系统中的故障与通信》

专知会员服务

2+阅读 · 今天6:23

论文解读 | 医学图像修复中的扩散模型：挑战、分类与未来方向

论文解读 | 医学图像修复中的扩散模型：挑战、分类与未来方向

专知会员服务

2+阅读 · 7月28日

博士论文 | 从算法到基础模型：强化学习的统一视角

博士论文 | 从算法到基础模型：强化学习的统一视角

专知会员服务

6+阅读 · 7月28日

面向国防作战的最佳自主与蜂群无人机技术

面向国防作战的最佳自主与蜂群无人机技术

专知会员服务

7+阅读 · 7月28日

《异构人类团队的协作决策过程混合建模研究》

《异构人类团队的协作决策过程混合建模研究》

专知会员服务

7+阅读 · 7月28日

《C5ISR系统中的注意力动态与自适应决策支持研究：视觉与多模态注意力引导对任务绩效影响的递归量化分析》最新36页报告

《C5ISR系统中的注意力动态与自适应决策支持研究：视觉与多模态注意力引导对任务绩效影响的递归量化分析》最新36页报告

专知会员服务

8+阅读 · 7月28日

《设计思维中的人机协作：生成式人工智能对共情访谈影响的探究》140页

《设计思维中的人机协作：生成式人工智能对共情访谈影响的探究》140页

专知会员服务

9+阅读 · 7月28日

博士论文 | 面向大模型推理的内存高效算法

博士论文 | 面向大模型推理的内存高效算法

专知会员服务

5+阅读 · 7月27日

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

专知会员服务

10+阅读 · 7月27日

《无人系统互操作性导论——无人系统联合架构（JAUS）》

《无人系统互操作性导论——无人系统联合架构（JAUS）》

专知会员服务

14+阅读 · 7月27日

美空军新型反无人机部队初探

美空军新型反无人机部队初探

专知会员服务

9+阅读 · 7月27日

相关VIP内容

【ACL2021】ERICA:通过对比学习提高预训练语言模型的实体和关系理解

专知会员服务

26+阅读 · 2021年8月12日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【知识图谱@ACL2020】Knowledge Graphs in Natural Language Processing

【知识图谱@ACL2020】Knowledge Graphs in Natural Language Processing

专知会员服务

66+阅读 · 2020年7月12日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

BERT到底如何work的？A Primer in BERTology: What we know about how BERT works

专知会员服务

50+阅读 · 2020年2月28日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

【Google论文强烈推荐】ALBERT:基于精简BERT的自我监督学习的语言表示，ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

专知会员服务

25+阅读 · 2019年12月21日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

热门VIP内容

开通专知VIP会员享更多权益服务

《防空系统对自主武器系统辩论中“有意义的人类控制”的启示》70页报告

《同步多无人机系统中的故障与通信》

人工智能大语言模型引擎如何重塑全球冲突信息环境最新50页

“对标ChatGPT”：乌军研发Marichka AI系统用于战场筹划

相关资讯

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

Github项目推荐 | awesome-bert：BERT相关资源大列表

Github项目推荐 | awesome-bert：BERT相关资源大列表

AI研习社

27+阅读 · 2019年2月26日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

相关论文

Chain-of-Dictionary Prompting Elicits Translation in Large Language Models

Arxiv

0+阅读 · 2023年5月17日

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Arxiv

0+阅读 · 2023年5月16日

What Makes Pre-trained Language Models Better Zero-shot Learners?

Arxiv

0+阅读 · 2023年5月16日

A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization

Arxiv

0+阅读 · 2023年5月16日

Soft Prompt Decoding for Multilingual Dense Retrieval

Arxiv

0+阅读 · 2023年5月15日

MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

Arxiv

0+阅读 · 2023年5月15日

Towards Transliteration between Sindhi Scripts from Devanagari to Perso-Arabic

Arxiv

0+阅读 · 2023年5月12日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月12日

Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond

Arxiv

21+阅读 · 2021年9月2日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

相关基金

VSMC源性血管紧张素转化酶通过C-结构域促动脉粥样硬化的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥DIF（DRIP1-Interacting Factor）在胁迫信号应答中的功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

解构近边吸收谱

国家自然科学基金

0+阅读 · 2012年12月31日

拟南芥光敏色素A的蛋白磷酸化调控其信号传导的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

双语者句子理解过程中句法加工的认知/神经时间动态性

国家自然科学基金

0+阅读 · 2012年12月31日

NUAK1介导LKB1信号通路与PTEN信号通路相互作用的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

线粒体tRNA前体加工与细胞周期调控之间偶联机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

基于潜在语义对偶空间的跨语言信息检索理论和算法研究

国家自然科学基金

1+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员