Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log. We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show that Data Station: i) outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) is orders of magnitude faster than alternative secure data-sharing frameworks; and iii) introduces small overhead on the critical path.
翻译:数据池化与共享能够提升并分发其价值。然而,由于数据一旦共享便无法撤回,在出于监管、隐私和法律原因需要控制数据发布的场景中,各方默认选择不共享数据。由于选择性控制数据发布较为困难,现有少数数据共享联盟往往基于长期而繁琐的一次性谈判所达成的数据共享协议构建。我们提出数据站(Data Station)这一数据托管方案,旨在促成数据共享联盟的形成。数据所有者将数据托管至该平台,确信未经其同意数据不会泄露。数据使用者则将其计算委托给托管平台。数据托管平台依赖委托计算来执行查询,而无需预先释放数据。数据站利用硬件飞地(hardware enclaves)在参与者间建立信任,并通过集中化数据与计算来生成审计日志。我们在不可信中介上运行机器学习与数据共享应用,对数据站进行评估。除重要的定性优势外,我们证明:i) 在机器学习应用中,数据站在准确率和运行时间上优于联邦学习基线方法;ii) 相比替代性安全数据共享框架,其速度快数个数量级;iii) 在关键路径上引入的开销极小。