Using Sequences of Life-events to Predict Human Lives

Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.

翻译：过去十年中，机器学习通过灵活的计算模型彻底革新了计算机分析文本的能力。由于与书面语言在结构上的相似性，基于Transformer的架构也在分析蛋白质结构、音乐、电子健康记录和天气预报等多变量序列方面展现出潜力。我们同样可以用一种与语言具有结构相似性的方式来表征人生。从某种视角看，人生不过是事件的序列：人们出生、看儿科医生、上学、迁居、结婚，等等。在此，我们利用这种相似性，将自然语言处理的创新成果应用于研究基于详细事件序列的人生演变与可预测性。我们基于现有最全面的注册数据展开研究——这些数据覆盖全国超过六百万个体长达数十年的记录。我们的数据包含与健康、教育、职业、收入、住址及工作时间相关的生命事件信息，精确到日。我们将生命事件嵌入到统一的向量空间中，证明该嵌入空间具有鲁棒性和高度结构化特征。我们的模型能够预测从早期死亡率到个性特征等多样化结果，性能大幅超越现有最优模型。通过可解释深度学习的方法，我们剖析算法以揭示影响预测的关键因素。这一框架使研究者能够识别影响人生结果的新潜在机制，并探索个性化干预的相应可能性。