The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks -- combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, \textbf{N}earest \textbf{N}eighbor \textbf{O}ccupational \textbf{S}kill \textbf{E}xtraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction \emph{without} additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30\% span-F1 in cross-dataset settings.
翻译:劳动力市场正在快速变化,这促使人们愈发关注从文本中自动抽取职业技能。随着英文基准职位描述数据集的出现,需要能够很好地处理其多样性的系统。我们应对了职业技能数据集任务中的复杂性——结合并利用多个数据集进行技能抽取,以识别数据集中罕见的技能,并克服跨数据集技能稀缺的问题。具体而言,我们研究了语言模型的检索增强,采用外部数据存储以统一数据集的方式检索相似技能。我们提出的方法——\textbf{N}earest \textbf{N}eighbor \textbf{O}ccupational \textbf{S}kill \textbf{E}xtraction (NNOSE) 通过从数据存储中其他数据集检索邻近技能,有效利用了多个数据集。这在不额外微调的情况下提升了技能抽取效果。关键在于,我们观察到在预测低频模式方面性能有所提升,在跨数据集设置中,span-F1 分数提升高达 30%。