Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.
翻译:社交媒体数据因其信息丰富性以及对自动处理的挑战性,在过去十余年间持续受到自然语言处理(NLP)研究者的关注。由于社交媒体中的语言使用更为非正式、自发,且遵循多种不同的社会方言,NLP模型的性能常因此下降。该问题的一种解决方案是在处理前将数据转换至标准变体,此过程亦称为词汇规范化。针对此任务,已有多种基准数据集与模型被提出。MultiLexNorm 基准旨在整合这些工作,但其涵盖的语言几乎全部属于拉丁字母书写的印欧语系。为此,我们提出了 MultiLexNorm 的扩展版本,涵盖来自 4 种不同文字体系、5 个不同语系的亚洲语言。实验表明,先前的最优模型在新语言上表现欠佳,为此我们提出一种基于大语言模型(LLMs)的新架构,其展现出更稳健的性能。最后,我们分析了遗留错误,揭示了该任务未来的研究方向。