Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.
翻译:语言模型在处理非标准语言变体文本时存在系统性性能差距,然而其对多种语言中特定变体俚语的理解能力仍未得到充分探索。本研究对七种前沿语言模型在印度英语(en-IN)和澳大利亚英语(en-AU)中的俚语认知能力进行了全面评估。我们构建了两个互补数据集:\textsc{web}数据集包含从Urban Dictionary收集的377个网络来源使用示例,\textsc{gen}数据集则涵盖1,492个在不同场景下合成的俚语使用实例。我们通过三项任务评估语言模型:目标词预测(TWP)、引导式目标词预测(TWP$^*$)以及目标词选择(TWS)。研究结果揭示了四个关键发现:(1)模型在TWS任务上的平均表现优于TWP和TWP$^*$任务,平均准确率分别从0.03提升至0.49;(2)模型在\textsc{web}数据集上的平均表现优于\textsc{gen}数据集,在TWP和TWP$^*$任务中平均相似度分数分别提升0.03和0.05;(3)在所有模型和数据集上平均计算时,en-IN任务表现优于en-AU任务,其中TWS任务差异最为显著,平均准确率从0.44提升至0.54。这些发现揭示了针对特定语言变体(尤其是俚语表达)时,生成式与判别式能力之间存在根本性不对称,即使在英语这样技术资源丰富的语言中亦是如此。