基于强化学习的求职申请评估中定制奖励函数的数学框架 (Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning)

Most of the traditional Applicant Tracking Systems (ATS) depend on strict matching using keywords, where candidates that are highly qualified are many times disqualified because of minor semantic differences. In this article, the two-stage process of developing a more comprehensive resume assessment system based on a small language model that is trained with fewer than 600M parameters is introduced and fine-tuned by using GRPO with a uniquely designed reward function. The initial stage is Supervised Fine-Tuning (SFT), which is used to create a strong base model with the ability to perceive resumes beyond superficial overlap of keywords. This SFT model is further optimized in the second step with Reinforcement Learning (RL) via GRPO with the help of multi-component-based rewarding, which will not be considered as a commission of tokens matching. In the initial RL experiments, we found a severe difficulty in the shape of reward hacking: overly aggressive penalty terms resulted in unstable training dynamics and prohibitively negative model behavior. This was solved by trial-and-error refinement of the reward and careful training hyperparameter tuning, which led to a stable and controlled process of gentle polishing. The GRPO-refined model shows high real-life performance, as it shows an accuracy of 91% on unseen data used for testing. It has a high recall of 0.85 on the SELECTED class with a perfect precision of 1.0, which highlights its high reliability for identifying qualified applicants. These findings demonstrate that an appropriately structured two-step fine-tuning pipeline can effectively be used to transfer a small language model into human-like candidate evaluation, surpassing the shortcomings of both traditional ATS systems and unrefined uses of reinforcement learning.

翻译：传统的申请人追踪系统大多依赖关键词的严格匹配，导致许多高度合格的候选人因细微的语义差异而被淘汰。本文介绍了一种基于参数量少于6亿的小型语言模型构建更全面简历评估系统的两阶段流程，该系统通过GRPO算法及独特设计的奖励函数进行微调。第一阶段采用监督微调，旨在建立一个能够超越关键词表层重叠、深入理解简历的强基础模型。在第二阶段，该SFT模型通过GRPO算法结合基于多组件的奖励机制进行强化学习优化，该机制不依赖于简单的词元匹配。在初步的强化学习实验中，我们发现了奖励破解这一严重问题：过于激进的惩罚项导致训练动态不稳定及模型行为极端负面。通过反复试验优化奖励函数并精细调整训练超参数，我们最终实现了稳定可控的渐进优化过程。经GRPO优化的模型在现实场景中表现出色，在未见测试数据上达到91%的准确率。其对"入选"类别的召回率达0.85，同时保持1.0的精确度，突显了其在识别合格申请人方面的高可靠性。这些结果表明，结构合理的两阶段微调流程能有效将小型语言模型转化为类人类的候选人评估工具，克服了传统ATS系统及未经优化的强化学习应用的缺陷。