远距评估：语言模型在澳大利亚与印度英语俚语上的表现评估 (Far Out: Evaluating Language Models on Slang in Australian and Indian English)

Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

翻译：语言模型在处理非标准语言变体文本时存在系统性性能差距，然而其对多种语言中特定变体俚语的理解能力仍未得到充分探索。本研究对七种前沿语言模型在印度英语（en-IN）和澳大利亚英语（en-AU）中的俚语认知能力进行了全面评估。我们构建了两个互补数据集：\textsc{web}数据集包含从Urban Dictionary收集的377个网络来源使用示例，\textsc{gen}数据集则涵盖1,492个在不同场景下合成的俚语使用实例。我们通过三项任务评估语言模型：目标词预测（TWP）、引导式目标词预测（TWP$^*$）以及目标词选择（TWS）。研究结果揭示了四个关键发现：（1）模型在TWS任务上的平均表现优于TWP和TWP$^*$任务，平均准确率分别从0.03提升至0.49；（2）模型在\textsc{web}数据集上的平均表现优于\textsc{gen}数据集，在TWP和TWP$^*$任务中平均相似度分数分别提升0.03和0.05；（3）在所有模型和数据集上平均计算时，en-IN任务表现优于en-AU任务，其中TWS任务差异最为显著，平均准确率从0.44提升至0.54。这些发现揭示了针对特定语言变体（尤其是俚语表达）时，生成式与判别式能力之间存在根本性不对称，即使在英语这样技术资源丰富的语言中亦是如此。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

【牛津大学博士论文】迈向具有类人自然语言理解的语言模型

专知会员服务

29+阅读 · 2024年10月28日