Far Out: Evaluating Language Models on Slang in Australian and Indian English

Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

翻译：语言模型在处理非标准语言变体文本时存在系统性性能差距，然而其对特定变体俚语的理解能力在多种语言中仍未得到充分探索。本研究对七种最先进语言模型在印度英语（en-IN）和澳大利亚英语（en-AU）中的俚语认知能力进行了全面评估。我们构建了两个互补数据集：WEB数据集包含从Urban Dictionary收集的377个网络来源使用实例，GEN数据集则涵盖1,492个在不同场景下合成的俚语使用实例。我们通过三个任务评估语言模型：目标词预测（TWP）、引导式目标词预测（TWP$^*$）和目标词选择（TWS）。研究结果揭示四个关键发现：（1）TWS任务的平均模型性能显著优于TWP和TWP$^*$，平均准确率分别从0.03提升至0.49；（2）模型在WEB数据集上的平均表现优于GEN数据集，TWP和TWP$^*$任务的平均相似度分数分别提升0.03和0.05；（3）当跨所有模型和数据集取平均值时，en-IN任务表现优于en-AU，其中TWS任务差异最为显著，平均准确率从0.44提升至0.54。这些发现揭示了特定语言变体（尤其是俚语表达）在生成式与判别式能力间存在根本性不对称，即使在英语这类技术资源丰富的语言中亦是如此。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

大规模视觉-语言模型的基准、评估、应用与挑战

专知会员服务

18+阅读 · 2025年2月10日

【牛津大学博士论文】迈向具有类人自然语言理解的语言模型

专知会员服务

29+阅读 · 2024年10月28日