Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models are currently available on request, while the paper is under review. Please email aditya.joshi@unsw.edu.au.
翻译:尽管已知大型语言模型(LLMs)对非主流语言变体存在偏见,但目前尚无针对英语变体情感分析的标注数据集。为填补这一空白,我们提出了BESSTIE——一个面向三种英语变体(澳大利亚英语en-AU、印度英语en-IN和英国英语en-UK)的情感与讽刺分类基准。通过采集谷歌地点评论和Reddit评论两个领域的网络内容,我们采用基于地理位置和基于主题的两种过滤方法构建了这些语言变体的数据集。各语言变体的母语者对数据集进行了人工情感与讽刺标注。随后,我们在这些数据集上对九种大型语言模型(涵盖编码器/解码器架构及单语/多语模型)进行微调,并评估其在两项任务上的性能。实验结果表明,模型在内圈英语变体(即en-AU和en-UK)上表现持续更优,而在en-IN上性能显著下降,尤其在讽刺检测任务中。我们还报告了跨变体泛化面临的挑战,凸显了构建如本数据集这类针对特定语言变体的数据集的必要性。BESSTIE有望成为未来推动公平性大型语言模型研究(特别是在语言变体层面)的重要评估基准。BESSTIE数据集、代码与模型目前可通过申请获取,相关论文正在评审中。联系邮箱:aditya.joshi@unsw.edu.au。