The ability to synthesize realistic data in a parametrizable way is valuable for a number of reasons, including privacy, missing data imputation, and evaluating the performance of statistical and computational methods. When the underlying data generating process is complex, data synthesis requires approaches that balance realism and simplicity. In this paper, we address the problem of synthesizing sequential categorical data of the type that is increasingly available from mobile applications and sensors that record participant status continuously over the course of multiple days and weeks. We propose the paired Markov Chain (paired-MC) method, a flexible framework that produces sequences that closely mimic real data while providing a straightforward mechanism for modifying characteristics of the synthesized sequences. We demonstrate the paired-MC method on two datasets, one reflecting daily human activity patterns collected via a smartphone application, and one encoding the intensities of physical activity measured by wearable accelerometers. In both settings, sequences synthesized by paired-MC better capture key characteristics of the real data than alternative approaches.
翻译:以可参数化方式合成逼真数据的能力因多种原因具有重要价值,包括隐私保护、缺失数据插补以及统计与计算方法性能评估。当底层数据生成过程复杂时,数据合成需要兼顾真实性与简洁性的方法。本文针对日益从移动应用和传感器中获取的序列分类数据合成问题展开研究——这类数据可连续记录参与者数天至数周的状态变化。我们提出配对马尔可夫链方法,这是一种灵活框架,既能生成高度模仿真实数据特征的序列,又能提供简洁机制以修改合成序列的特性。我们在两个数据集上验证了该方法:一个反映通过智能手机应用收集的日常人类活动模式,另一个编码可穿戴加速度计测量的身体活动强度。在这两种情境下,配对马尔可夫链合成的序列在捕捉真实数据关键特征方面均优于其他方法。