BiPhone: Modeling Inter Language Phonetic Influences in Text

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

翻译：大量用户因技术不对称而被迫使用其识字水平较低的语言访问网络。这些用户的第二语言（L2）文本常包含其母语（L1）影响下的大量错误。我们提出一种方法，用于挖掘L1与L2语言对的音位混淆（即L1使用者可能混淆的L2语音）。这些混淆被嵌入一个生成模型（Bi-Phone），用于合成生成带有干扰的L2文本。通过人工评估，我们证明Bi-Phone能够生成合理的干扰文本，这些文本在不同L1间存在差异，并在网络上具有广泛覆盖性。我们还将本技术应用于流行语言理解基准SuperGLUE（得到FunGLUE，即带语音噪声的GLUE），并展示当前最先进的语言理解模型表现不佳。我们引入了一种新的音位预测预训练任务，帮助字节模型恢复接近SuperGLUE的性能。最后，我们发布FunGLUE基准以促进语音鲁棒语言模型的进一步研究。据我们所知，FunGLUE是首个在文本中引入L1-L2交互的基准。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日