This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP $\varepsilon = 2$ while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.
翻译:本文介绍蜉蝣(Mayfly)——一种联邦分析方法,支持对设备端瞬时数据流进行聚合查询,而无需在中心服务器持久化敏感用户数据。蜉蝣通过SQL可编程性实现设备端窗口化与贡献度约束以最小化数据量,采用流式差分隐私(DP)对用户数据进行匿名化处理,并强制要求在服务器端立即进行跨设备内存聚合——确保仅向数据分析师公开隐私化聚合结果。在一个基于私有位置数据估算交通碳排放的可持续性应用案例中,蜉蝣在超过5亿台设备上计算了逾400万条统计量,在满足严格数据效用要求的同时,实现了每设备每周DP $\varepsilon = 2$的隐私保障。为此,我们设计了一种针对分组求和(Group-By-Sum)工作负载的新型DP机制,该机制利用位置数据的统计特性,并具备拓展至其他领域的潜在适用性。