Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
翻译:数字足迹(记录个体与数字系统交互的数据)对于行为研究、个性化应用开发以及机器学习模型训练至关重要。然而,该领域的研究常因缺乏多样化且易于获取的数据而受到阻碍。为突破这一限制,我们提出一种利用大语言模型(LLM)智能体合成真实数字足迹的新方法。该方法从结构化用户画像出发,生成多样化且合理的用户事件序列,最终产生相应的数字产物,如电子邮件、消息、日历条目、提醒事项等。内在评估结果表明,生成的数据集比现有基线方法更具多样性和真实性。此外,在真实世界分布外任务评估中,基于我们合成数据微调的模型性能优于在其他合成数据集上训练的模型。