This paper describes our multiclass classification system developed as part of the LTEDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score (ranked first out of six) with variable performance for other language and class-label conditions. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. The results suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.
翻译:本文描述了我们在LTEDI@RANLP-2023共享任务中开发的多类别分类系统。我们采用基于BERT的语言模型,针对五种语言条件(英语、西班牙语、印地语、马拉雅拉姆语和泰米尔语)检测社交媒体评论中的恐同与恐跨性别内容。我们对基于Transformer的跨语言预训练语言模型XLMRoBERTa进行了重训练,使用具有时空相关性的社交媒体语言数据。此外,我们还使用模拟的脚本混合社交媒体语言数据对一部分模型进行了重训练,不同模型性能表现各异。我们开发了基于加权宏平均F1分数性能最优的七标签分类系统(在六个参赛系统中排名第一)用于马拉雅拉姆语,而其他语言和类别标签条件下的性能则有所差异。我们发现,与基线相比,加入这些时空数据后,所有语言和任务条件下的分类性能均得到改善。结果表明,基于Transformer的语言分类系统对语域特定和语言特定的重训练具有敏感性。