Text Augmentation is an important task for low-resource languages. It helps deal with the problem of data scarcity. A data augmentation strategy is used to deal with the problem of data scarcity. Through the years, much work has been done on data augmentation for the English language. In contrast, very less work has been done on Indian languages. This is contrary to the fact that data augmentation is used to deal with data scarcity. In this work, we focus on implementing techniques like Easy Data Augmentation, Back Translation, Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for text classification on different languages. We focus on 6 Indian languages namely: Sindhi, Marathi, Hindi, Gujarati, Telugu, and Sanskrit. According to our knowledge, no such work exists for text augmentation on Indian languages. We carry out binary as well as multi-class text classification to make our results more comparable. We get surprising results as basic data augmentation techniques surpass LLMs.
翻译:文本增强是低资源语言的重要任务,有助于解决数据稀缺问题。数据增强策略被用于应对数据不足的挑战。多年来,英语语言的数据增强研究已取得大量进展,而针对印度语言的相关工作则非常有限。这违背了数据增强旨在解决数据稀缺问题的初衷。在本工作中,我们聚焦于在多种语言上实现简易数据增强、回译、释义、基于大语言模型的文本生成以及基于大语言模型的文本扩展等技术,以支持文本分类任务。我们选取了6种印度语言:信德语、马拉地语、印地语、古吉拉特语、泰卢固语和梵语。据我们所知,目前尚无针对印度语言的文本增强研究。我们进行了二分类和多分类文本分类实验,以提高结果的可比性。令人惊讶的是,基础数据增强技术的表现超越了基于大语言模型的方法。