PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

As retrieval models converge on generic benchmarks, the pressing question is no longer "who scores higher" but rather "where do systems fail, and why?" Person-job matching is a domain that urgently demands such diagnostic capability -- it requires systems not only to verify explicit constraints but also to perform skill-transfer inference and job-competency reasoning, yet existing benchmarks provide no systematic diagnostic support for this task. We introduce PJB (Person-Job Benchmark), a reasoning-aware retrieval evaluation dataset that uses complete job descriptions as queries and complete resumes as documents, defines relevance through job-competency judgment, is grounded in real-world recruitment data spanning six industry domains and nearly 200,000 resumes, and upgrades evaluation from "who scores higher" to "where do systems differ, and why" through domain-family and reasoning-type diagnostic labels. Diagnostic experiments using dense retrieval reveal that performance heterogeneity across industry domains far exceeds the gains from module upgrades for the same model, indicating that aggregate scores alone can severely mislead optimization decisions. At the module level, reranking yields stable improvements while query understanding not only fails to help but actually degrades overall performance when combined with reranking -- the two modules face fundamentally different improvement bottlenecks. The value of PJB lies not in yet another leaderboard of average scores, but in providing recruitment retrieval systems with a capability map that pinpoints where to invest.

翻译：随着检索模型在通用基准测试上的性能趋于收敛，紧迫的问题不再是“谁的得分更高”，而是“系统在何处失败，以及为何失败？”人岗匹配是一个亟需此类诊断能力的领域——它不仅要求系统验证显性约束，还需要进行技能迁移推理和岗位胜任力推断，然而现有基准测试并未为此任务提供系统性诊断支持。我们提出了PJB（人岗匹配基准测试），这是一个推理感知的检索评估数据集：它以完整职位描述作为查询，完整简历作为文档；通过岗位胜任力判断定义相关性；基于覆盖六大行业领域、近20万份简历的真实招聘数据构建；并通过领域族和推理类型的诊断标签，将评估从“谁得分更高”升级为“系统在何处存在差异，以及为何存在差异”。使用稠密检索模型进行的诊断实验表明：跨行业领域的性能异质性远超同一模型模块升级带来的增益，这说明仅依赖聚合分数会严重误导优化决策。在模块层面，重排序能带来稳定的性能提升，而查询理解不仅未能提供帮助，在与重排序结合时甚至会降低整体性能——这两个模块面临着本质不同的改进瓶颈。PJB的价值不在于提供又一个平均分数排行榜，而在于为招聘检索系统提供一张能力地图，精准指明研发投入的方向。