Online job ads serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and e-recruitment processes. Since such ads are typically formatted in free text, natural language processing (NLP) technologies are required to automatically process them. We specifically focus on the task of detecting skills (mentioned literally, or implicitly described) and linking them to a large skill ontology, making it a challenging case of extreme multi-label classification (XMLC). Given that there is no sizable labeled (training) dataset are available for this specific XMLC task, we propose techniques to leverage general Large Language Models (LLMs). We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction, and present a contrastive learning strategy that proves effective in the task. Our results across three skill extraction benchmarks show a consistent increase of between 15 to 25 percentage points in \textit{R-Precision@5} compared to previously published results that relied solely on distant supervision through literal matches.
翻译:在线招聘广告为技能需求提供了宝贵的信息来源,在劳动力市场分析和电子招聘过程中发挥着重要作用。由于此类广告通常以自由文本形式呈现,需要借助自然语言处理(NLP)技术自动处理。我们特别关注检测技能(显式提及或隐含描述)并将其链接到大型技能本体的任务,这使其成为极端多标签分类(XMLC)的一个具有挑战性的案例。鉴于针对这一特定XMLC任务缺乏大规模标注(训练)数据集,我们提出了利用通用大型语言模型(LLMs)的技术。我们描述了一种成本高效的方法,用于生成精确的全合成标注数据集以进行技能抽取,并提出了一种在该任务中有效的对比学习策略。我们在三个技能抽取基准测试中的结果表明,与以往仅依赖基于文字匹配的远程监督的发表结果相比,\textit{R-Precision@5}指标持续提升了15至25个百分点。