Matching job descriptions (JDs) with suitable talent requires models capable of understanding not only textual similarities between JDs and candidate resumes but also contextual factors such as geographical location and academic seniority. To address this challenge, we propose a two-stage training framework for large language models (LLMs). In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules, such as geographical alignment and research area overlap. While effective, this model primarily learns patterns that defined by the matching rules. In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO), termed Rank Preference Optimization (RankPO), to align the model with AI-curated pairwise preferences emphasizing textual understanding. Our experiments show that while the first-stage model achieves strong performance on rule-based data (nDCG@20 = 0.706), it lacks robust textual understanding (alignment with AI annotations = 0.46). By fine-tuning with RankPO, we achieve a balanced model that retains relatively good performance in the original tasks while significantly improving the alignment with AI preferences. The code and data are available at https://github.com/yflyzhang/RankPO.
翻译:将职位描述(JD)与合适的人才相匹配,需要模型不仅能理解JD与候选人简历之间的文本相似性,还能理解地理位置和学术资历等上下文因素。为应对这一挑战,我们提出了一个用于大语言模型(LLM)的两阶段训练框架。在第一阶段,我们使用对比学习方法,在一个基于现实世界匹配规则(如地理位置对齐和研究领域重叠)构建的数据集上训练模型。该模型虽然有效,但主要学习的是由匹配规则定义的模式。在第二阶段,我们引入了一种新颖的、受直接偏好优化(DPO)启发的基于偏好的微调方法,称为排序偏好优化(RankPO),旨在使模型与强调文本理解的AI生成的成对偏好对齐。我们的实验表明,第一阶段模型在基于规则的数据上表现出色(nDCG@20 = 0.706),但缺乏稳健的文本理解能力(与AI标注的对齐度 = 0.46)。通过使用RankPO进行微调,我们获得了一个平衡的模型,该模型在原始任务上保持了相对良好的性能,同时显著提高了与AI偏好的对齐度。代码和数据可在 https://github.com/yflyzhang/RankPO 获取。