In this work, we develop a pipeline for historical-psychological text analysis in classical Chinese. Humans have produced texts in various languages for thousands of years; however, most of the computational literature is focused on contemporary languages and corpora. The emerging field of historical psychology relies on computational techniques to extract aspects of psychology from historical corpora using new methods developed in natural language processing (NLP). The present pipeline, called Contextualized Construct Representations (CCR), combines expert knowledge in psychometrics (i.e., psychological surveys) with text representations generated via transformer-based language models to measure psychological constructs such as traditionalism, norm strength, and collectivism in classical Chinese corpora. Considering the scarcity of available data, we propose an indirect supervised contrastive learning approach and build the first Chinese historical psychology corpus (C-HI-PSY) to fine-tune pre-trained models. We evaluate the pipeline to demonstrate its superior performance compared with other approaches. The CCR method outperforms word-embedding-based approaches across all of our tasks and exceeds prompting with GPT-4 in most tasks. Finally, we benchmark the pipeline against objective, external data to further verify its validity.
翻译:本研究开发了一套面向古典中文的历史-心理文本分析流水线。人类使用多种语言进行文本创作已有数千年历史,但现有计算文献多集中于当代语言与语料库。历史心理学这一新兴领域依赖计算技术,通过自然语言处理(NLP)领域的新方法从历史语料中提取心理维度。本文提出的"语境化构念表征(CCR)"流水线,将心理测量学(即心理问卷调查)领域专业知识与基于Transformer语言模型生成的文本表征相结合,用于测量古典中文语料库中的传统主义、规范强度、集体主义等心理构念。鉴于可用数据稀缺,我们提出一种间接监督对比学习方法,并构建首个中文历史心理学语料库(C-HI-PSY)以微调预训练模型。通过评估,该流水线相较其他方法展现出更优性能。CCR方法在所有任务中均超越基于词嵌入的方法,且在多数任务中优于GPT-4的提示性结果。最后,我们通过客观外部数据对流水线进行基准测试,进一步验证其有效性。