Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audiences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children's reading materials), where the goal is to reduce the ambiguity of text content and increase its overall readability for a specific group of audience. Understanding how large language models can capture these constraints can help researchers build better, more impactful tools for wider use beyond the NLP community. Towards this end, we introduce SpeciaLex, a benchmark for evaluating a language model's ability to follow specialized lexicon-based constraints across 18 diverse subtasks with 1,785 test instances covering core tasks of Checking, Identification, Rewriting, and Open Generation. We present an empirical evaluation of 15 open and closed-source LLMs and discuss insights on how factors such as model scale, openness, setup, and recency affect performance upon evaluating with the benchmark.
翻译:专用词汇集是具有特定约束条件的词语集合,这些约束包括特殊定义、特定角色和目标受众群体。在内容生成和文档编制任务(如编写技术手册或儿童阅读材料)中,此类约束对于降低文本内容的歧义性、提升面向特定受众的整体可读性至关重要。探究大语言模型如何捕捉这些约束条件,有助于研究者构建更优质、更具影响力的工具,使其在自然语言处理领域之外获得更广泛的应用。为此,我们提出SpeciaLex基准测试,用于评估语言模型在18个多样化子任务中遵循基于专用词汇约束的能力,共包含1,785个测试实例,涵盖检查、识别、重写和开放生成四大核心任务。通过对15个开源与闭源大语言模型进行实证评估,我们探讨了模型规模、开源属性、任务设置及模型时效性等因素在基准测试中对性能的影响机制。