Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5\% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.
翻译:知名人物的人生轨迹为人类动力学研究提供了关键信息。这些轨迹由(人物、时间、地点、活动类型)四元组构成,记录了人物的出生、入学、任职、参战等时空节点。然而,现有研究仅涵盖出生与死亡等有限活动类型,缺乏大规模细粒度轨迹数据。借助从维基百科提取(人物、时间、地点)三元组的工具,我们提出利用文本上下文作为补充信息,将这些三元组分类至24种精确定义类型的任务。其挑战在于三元组实体常分散于噪声文本环境中。我们采用句法图拉近三元组实体与相关信息的距离,并将其与文本嵌入融合以实现人生轨迹活动分类。鉴于维基百科文本质量参差不齐,我们使用大语言模型优化文本以生成更规范的句法图。该框架取得了84.5%的分类准确率,超越基线方法。我们构建了迄今规模最大的细粒度人生轨迹数据集,包含589,193位人物跨越三个世纪的380万条标注活动记录。最后,我们展示了这些轨迹如何支撑跨时空的人类动力学宏观叙事。代码与数据已公开。