Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.
翻译:语言模型在处理非标准语言变体文本时存在系统性性能差距,然而其对特定变体俚语的理解能力在多种语言中仍未得到充分探索。本研究对七种最先进语言模型在印度英语(en-IN)和澳大利亚英语(en-AU)中的俚语认知能力进行了全面评估。我们构建了两个互补数据集:WEB数据集包含从Urban Dictionary收集的377个网络来源使用实例,GEN数据集则涵盖1,492个在不同场景下合成的俚语使用实例。我们通过三个任务评估语言模型:目标词预测(TWP)、引导式目标词预测(TWP$^*$)和目标词选择(TWS)。研究结果揭示四个关键发现:(1)TWS任务的平均模型性能显著优于TWP和TWP$^*$,平均准确率分别从0.03提升至0.49;(2)模型在WEB数据集上的平均表现优于GEN数据集,TWP和TWP$^*$任务的平均相似度分数分别提升0.03和0.05;(3)当跨所有模型和数据集取平均值时,en-IN任务表现优于en-AU,其中TWS任务差异最为显著,平均准确率从0.44提升至0.54。这些发现揭示了特定语言变体(尤其是俚语表达)在生成式与判别式能力间存在根本性不对称,即使在英语这类技术资源丰富的语言中亦是如此。