We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa-benchmark
翻译:我们提出JobResQA,一个用于评估大语言模型在涉及简历与职位描述的人力资源专项任务上机器阅读理解能力的多语言问答基准。该数据集包含581个问答对,覆盖五种语言(英语、西班牙语、意大利语、德语和中文)的105组合成简历-职位描述对,问题涵盖从基础事实提取到复杂跨文档推理的三个复杂度层级。我们提出一个基于真实数据源、通过去标识化与数据合成构建的数据生成流程,以确保真实性与隐私性;同时,通过受控的人口统计与职业属性(以占位符实现)支持系统性的偏见与公平性研究。我们还提出一种基于TEaR方法的低成本人机协同翻译流程,结合MQM错误标注与选择性后编辑,以构建高质量的多语言平行基准。我们采用LLM-as-judge方法对多个开源权重LLM系列进行基线评估,结果显示模型在英语和西班牙语上表现较好,但在其他语言上性能显著下降,凸显了人力资源应用场景中多语言机器阅读理解能力的关键短板。JobResQA为推进公平可靠的大语言模型人力资源系统提供了一个可复现的基准。该基准已公开于:https://github.com/Avature/jobresqa-benchmark