Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms

Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.

翻译：细粒度劳动力市场分析日益依赖于将非结构化招聘广告映射至标准化技能分类体系（如ESCO）。该映射任务自然可表述为极端多标签分类问题，但监督式解决方案受限于大规模分类对齐标注数据的稀缺性与高昂成本——在非英语场景中，招聘广告语言与正式技能定义存在显著差异，这一问题尤为突出。本文提出一种零样本技能提取框架，无需人工标注的招聘广告训练数据。该框架利用大语言模型从ESCO定义中合成训练实例，并引入基于ESCO二级类别的层次约束多技能生成机制，以提升多标签语境下的语义连贯性。基于合成语料库，我们训练了一个对比式双编码器，将招聘广告语句与ESCO技能描述对齐至共享嵌入空间；该编码器在BERT主干网络上集成双向长短期记忆网络与注意力池化层，以更好地建模信息密集的长篇幅任职要求陈述。上游基于RoBERTa的二元过滤器可剔除非技能语句，从而提升端到端精确度。实验表明：（i）相较于无约束配对，层次条件生成能同时提升生成流畅度与可区分性；（ii）所得多标签模型能有效迁移至真实中文招聘广告场景，实现优异的零样本检索性能（F1@5 = 0.72），并超越TF--IDF与标准BERT基线模型。总体而言，所提出的技术流程为劳动经济学与劳动力分析领域的自动化技能编码提供了可扩展、数据高效的解决方案。