This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall, and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines.
翻译:本文介绍了一种新工具——OccCANINE,用于自动将职业描述转化为HISCO分类系统。人工处理和分类职业描述的工作容易出错、繁琐且耗时。我们对预训练语言模型CANINE进行微调以实现自动化,从而在数秒或数分钟内完成以往需要数天或数周的工作。该模型基于来自22个不同来源、涵盖13种语言的1400万对职业描述与HISCO编码进行训练。实验表明,我们的方法在准确率、召回率和精确率上均超过90%。该工具有效突破了隐喻性的HISCO壁垒,使职业结构数据可便捷用于分析,在经济学、经济史及相关学科领域具有广泛适用性。