This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP $\varepsilon = 2$ while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.
翻译:本文介绍蜉蝣(Mayfly),一种联邦分析方法,能够在无需集中存储敏感用户数据的前提下,对瞬态设备端数据流执行聚合查询。蜉蝣通过SQL可编程性实现设备端窗口化与贡献度限制以最小化数据量,采用流式差分隐私(DP)对用户数据进行匿名化处理,并要求服务器端立即在内存中进行跨设备聚合——确保仅向数据分析师公开经过隐私化处理的聚合结果。在基于私有位置数据估算交通碳排放的可持续性应用案例中,蜉蝣已在超过5亿台设备上计算了逾400万条统计量,在满足严格数据效用要求的同时,实现了每设备每周DP $\varepsilon = 2$的隐私保障。为此,我们针对分组求和(Group-By-Sum)工作负载设计了一种新型DP机制,该机制利用位置数据的统计特性,并具备拓展至其他领域的潜在适用性。