This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history and various related disciplines.
翻译:本文介绍了一种新工具OccCANINE,用于将职业描述自动转换为HISCO分类体系。人工处理与分类职业描述的工作易出错、繁琐且耗时。我们对预训练语言模型(CANINE)进行微调,使其自动完成这一任务,从而将原本需要数天或数周的工作缩短至数秒或数分钟。该模型基于来自22个不同数据源、涵盖13种语言的1400万对职业描述与HISCO代码进行训练。实验表明,本方法的准确率、召回率和精确度均超过90%。我们的工具打破了隐喻性的HISCO壁垒,使这些数据能够直接用于职业结构分析,并在经济学、经济史及相关学科领域具有广泛的适用性。