This paper describes our multiclass classification system developed as part of the LTEDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score (ranked first out of six) with variable performance for other language and class-label conditions. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. The results suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.
翻译:本文描述了我们在LT-EDI@RANLP-2023共享任务中开发的多类分类系统。我们使用基于BERT的语言模型来检测五种语言(英语、西班牙语、印地语、马拉雅拉姆语和泰米尔语)社交媒体评论中的恐同及恐跨性别内容。我们利用与时空相关的社交媒体语言数据,对基于Transformer的跨语言预训练语言模型XLMRoBERTa进行了再训练。此外,我们使用模拟的语码混合社交媒体语言数据对部分模型进行了再训练,获得了不同的性能表现。针对马拉雅拉姆语,我们基于加权宏平均F1分数开发了性能最优的七标签分类系统(在六个系统中位列第一),而在其他语言和类标条件下的性能则有所差异。我们发现,与基线相比,加入这种时空数据显著提升了所有语言和任务条件下的分类性能。结果表明,基于Transformer的语言分类系统对语域特定和语言特定的再训练具有敏感性。