Understanding labour market dynamics requires accurately identifying the skills required for and possessed by the workforce. Automation techniques are increasingly being developed to support this effort. However, automatically extracting skills from job postings is challenging due to the vast number of existing skills. The ESCO (European Skills, Competences, Qualifications and Occupations) framework provides a useful reference, listing over 13,000 individual skills. However, skills extraction remains difficult and accurately matching job posts to the ESCO taxonomy is an open problem. In this work, we propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs). We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts. We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM. Using synthetic data achieves an RP@10 score 10 points higher than previous distant supervision approaches. Adding GPT-4 re-ranking improves RP@10 by over 22 points over previous methods. We also show that Framing the task as mock programming when prompting the LLM can lead to better performance than natural language prompts, especially with weaker LLMs. We demonstrate the potential of integrating large language models at both ends of skills matching pipelines. Our approach requires no human annotations and achieve extremely promising results on skills extraction against ESCO.
翻译:理解劳动力市场动态需要准确识别工作岗位所需及劳动者所具备的技能。自动化技术正被越来越多地开发以支持这一工作。然而,由于现有技能数量庞大,从职位招聘中自动提取技能颇具挑战。ESCO(欧洲技能、能力、资格和职业)框架提供了有用参考,列出了超过13,000种单项技能。但技能提取仍然困难,且将职位招聘与ESCO分类准确匹配仍是一个开放性问题。本文提出一种基于大型语言模型的端到端零样本技能提取系统,用于从职位描述中提取技能。我们为所有ESCO技能生成合成训练数据,并训练分类器从职位招聘中提取技能提及。同时采用相似性检索器生成技能候选集,再利用第二个大型语言模型进行重新排序。使用合成数据的方法在RP@10得分上比之前的远程监督方法高出10个百分点。加入GPT-4重排序后,RP@10较之前方法提升超过22个百分点。我们还发现,在提示大型语言模型时将任务框架化为模拟编程,比自然语言提示能获得更优性能,尤其对较弱的大型语言模型效果更为显著。我们展示了在技能匹配管道的两端集成大型语言模型的潜力。该方法无需人工标注,并在针对ESCO的技能提取任务中取得了极具前景的结果。