Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.
翻译:人格是指个体在行为、思维和情感方面的差异。随着数字足迹(尤其是来自社交媒体的数据)日益丰富,自动化人格评估方法变得愈发重要。自然语言处理技术使得分析非结构化文本数据以识别人格指标成为可能。然而,本论文仍需解决两个核心挑战:大规模人格标注数据集的稀缺性,以及人格心理学与自然语言处理领域之间的脱节——这限制了模型的有效性和可解释性。为应对这些挑战,本论文提出了两个从Reddit平台收集的数据集——MBTI9k与PANDORA。Reddit以用户匿名性和多元讨论氛围著称。PANDORA数据集包含来自逾万名用户的1700万条评论,整合了MBTI与大五人格模型及人口统计信息,克服了数据规模、质量与标签覆盖范围的局限性。在这些数据集上的实验表明,人口统计变量会影响模型效度。为此,我们开发了SIMPA框架——一种通过匹配用户生成语句与验证问卷条目的可解释人格评估计算框架。该框架运用机器学习与语义相似度技术,在保持高可解释性与高效性的同时,实现了与人工评估相当的人格评估效果。尽管聚焦于人格评估,SIMPA的适用性可延伸至其他领域。其模型无关的设计架构、分层线索检测机制及可扩展性,使其适用于涉及复杂标签分类体系及目标概念与变量线索关联的各类研究与实践应用。