Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.
翻译:摘要:劳动经济学家通常通过将预测模型拟合到小型、精心构建的纵向调查数据集来分析就业数据。尽管机器学习方法有望解决此类问题,但这些调查数据集规模过小,难以充分利用这些方法。近年来,大规模在线简历数据集也变得可用,提供了数百万个体职业轨迹的数据。然而,标准计量经济模型无法利用其规模优势,也无法将其纳入调查数据的分析中。为此,我们开发了CAREER——一个针对工作序列的基础模型。CAREER首先在大规模被动收集的简历数据上进行预训练,然后针对较小但经过更精细整理的数据集进行微调,以用于经济推断。我们将CAREER拟合到一个包含2400万个简历工作序列的数据集,并在小型纵向调查数据集上进行了调整。结果显示,CAREER能够准确预测工作序列,在三个广泛使用的经济学数据集上优于计量经济学基线模型。此外,我们发现CAREER可用于对其他下游变量形成良好的预测。例如,将CAREER纳入工资模型后,其预测效果优于当前使用的计量经济学模型。