The last few years have witnessed a spate of data protection regulations in conjunction with an ever-growing appetite for data usage in large businesses, which presents significant challenges for businesses to maintain compliance. To address this conflict, we present Data Guard - a fine-grained, purpose-based access control system for large data warehouses. Data Guard enables authoring policies based on semantic descriptions of data and purpose of data access. Data Guard then translates these policies into SQL views that mask data from the underlying warehouse tables. At access time, Data Guard ensures compliance by transparently routing each table access to the appropriate data-masking view based on the purpose of the access, thus minimizing the effort of adopting Data Guard in existing applications. Our enforcement solution allows masking data at much finer granularities than what traditional solutions allow. In addition to row and column level data masking, Data Guard can mask data at the sub-cell level for columns with non-atomic data types such as structs, arrays, and maps. This fine-grained masking allows Data Guard to preserve data utility for consumers while ensuring compliance. We implemented a number of performance optimizations to minimize the overhead of data masking operations. We perform numerous experiments to identify the key factors that influence the data masking overhead and demonstrate the efficiency of our implementation. Data Guard is deployed inside LinkedIn's production data warehouses and ensures compliance of more than 20,000 table accesses each day across different data processing engines.
翻译:近年来,数据保护法规的密集出台与大型企业对数据使用的日益增长需求并存,这给企业维持合规性带来了重大挑战。为解决这一矛盾,我们提出了数据卫士——一个面向大型数据仓库的细粒度、基于用途的访问控制系统。数据卫士支持基于数据语义描述和数据访问用途来编写策略,随后将这些策略转换为SQL视图,从而对底层仓库表中的数据进行脱敏处理。在访问时,数据卫士通过根据访问用途透明地将每个表访问路由至相应的数据脱敏视图来确保合规性,从而最大限度地减少了在现有应用中采用数据卫士所需的工作量。我们的执行方案允许以比传统方案更细的粒度进行数据脱敏。除了行级和列级数据脱敏外,数据卫士还能对包含非原子数据类型(如结构体、数组和映射)的列进行子单元格级别的数据脱敏。这种细粒度的脱敏使数据卫士能够在确保合规的同时,为数据使用者保留数据的实用性。我们实施了多项性能优化以最小化数据脱敏操作的开销。我们进行了大量实验以识别影响数据脱敏开销的关键因素,并证明了我们实现方案的高效性。数据卫士已在领英的生产数据仓库内部署,确保每天在不同数据处理引擎上超过20,000次表访问的合规性。