基于社交媒体文本的可解释人格评估计算框架 (A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media)

Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

翻译：人格是指个体在行为、思维和情感方面的差异。随着数字足迹（尤其是来自社交媒体的数据）日益丰富，自动化人格评估方法变得愈发重要。自然语言处理技术使得分析非结构化文本数据以识别人格指标成为可能。然而，本论文仍需解决两个核心挑战：大规模人格标注数据集的稀缺性，以及人格心理学与自然语言处理领域之间的脱节——这限制了模型的有效性和可解释性。为应对这些挑战，本论文提出了两个从Reddit平台收集的数据集——MBTI9k与PANDORA。Reddit以用户匿名性和多元讨论氛围著称。PANDORA数据集包含来自逾万名用户的1700万条评论，整合了MBTI与大五人格模型及人口统计信息，克服了数据规模、质量与标签覆盖范围的局限性。在这些数据集上的实验表明，人口统计变量会影响模型效度。为此，我们开发了SIMPA框架——一种通过匹配用户生成语句与验证问卷条目的可解释人格评估计算框架。该框架运用机器学习与语义相似度技术，在保持高可解释性与高效性的同时，实现了与人工评估相当的人格评估效果。尽管聚焦于人格评估，SIMPA的适用性可延伸至其他领域。其模型无关的设计架构、分层线索检测机制及可扩展性，使其适用于涉及复杂标签分类体系及目标概念与变量线索关联的各类研究与实践应用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日