With the growing popularity of code-mixed data, there is an increasing need for better handling of this type of data, which poses a number of challenges, such as dealing with spelling variations, multiple languages, different scripts, and a lack of resources. Current language models face difficulty in effectively handling code-mixed data as they primarily focus on the semantic representation of words and ignore the auditory phonetic features. This leads to difficulties in handling spelling variations in code-mixed text. In this paper, we propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX. Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model. Through experimentation on various code-mixed datasets (of different languages) for sentiment, offensive and aggression classification tasks, we establish that our novel language modeling approach (SAMLM) results in improved robustness towards adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM based approach also results in better classification results over the popular baselines for code-mixed tasks. We use the explainability technique, SHAP (SHapley Additive exPlanations) to explain how the auditory features incorporated through SAMLM assist the model to handle the code-mixed text effectively and increase robustness against adversarial attacks \footnote{Source code has been made available on \url{https://github.com/20118/DefenseWithPhonetics}, \url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}.
翻译:随着代码混合数据的日益普及,对此类数据的有效处理需求不断增长,这带来了诸多挑战,例如处理拼写变体、多语言、不同文字系统及资源匮乏等问题。当前语言模型主要聚焦于词语的语义表征而忽略听觉语音特征,导致难以有效处理代码混合数据中拼写变体带来的困难。本文提出一种基于SOUNDEX词听觉信息的语言模型构建方法,通过包含SOUNDEX表征的掩码语言模型预训练步骤(SAMLM)及新型预训练模型输入方式,有效提升了代码混合文本的处理能力。我们在多语言代码混合数据集上针对情感分析、冒犯性检测及攻击性分类任务进行实验,验证了所提出的SAMLM语言建模方法能显著提升代码混合分类任务对抗攻击的鲁棒性。此外,基于SAMLM的方法在代码混合任务上的分类效果也优于主流基线模型。我们采用可解释性技术SHAP(SHapley Additive exPlanations)揭示了SAMLM引入的听觉特征如何有效辅助模型处理代码混合文本并增强对抗攻击鲁棒性\footnote{源代码已发布于\url{https://github.com/20118/DefenseWithPhonetics} 及 \url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}。