Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in this case, in Indian languages using Few Shot Learning (FSL). Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset with FSL. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.
翻译:在线滥用内容检测,特别是在低资源环境及音频模态中,仍是一个研究不足的领域。本文探讨了利用预训练音频表征在低资源语言(此处以印度语言为例)中检测侮辱性语言的潜力,并采用少样本学习(FSL)方法。借助Wav2Vec和Whisper等模型生成的强大表征,我们基于ADIMA数据集,通过FSL探索跨语言滥用检测。该方法将这些表征集成到模型无关元学习(MAML)框架中,以实现对10种语言中侮辱性语言的分类。我们尝试了不同的样本量(50-200),以评估有限数据对性能的影响。此外,还进行了特征可视化研究,以更好地理解模型行为。本研究凸显了预训练模型在低资源场景下的泛化能力,并为多语言语境中侮辱性语言的检测提供了有价值的见解。