Differentially Private Compression and the Sensitivity of LZ77

We initiate the study of differentially private data-compression schemes motivated by the insecurity of the popular "Compress-Then-Encrypt" framework. Data compression is a useful tool which exploits redundancy in data to reduce storage/bandwidth when files are stored or transmitted. However, if the contents of a file are confidential then the length of a compressed file might leak confidential information about the content of the file itself. Encrypting a compressed file does not eliminate this leakage as data encryption schemes are only designed to hide the content of confidential message instead of the length of the message. In our proposed Differentially Private Compress-Then-Encrypt framework, we add a random positive amount of padding to the compressed file to ensure that any leakage satisfies the rigorous privacy guarantee of $(\epsilon,\delta)$-differential privacy. The amount of padding that needs to be added depends on the sensitivity of the compression scheme to small changes in the input, i.e., to what degree can changing a single character of the input message impact the length of the compressed file. While some popular compression schemes are highly sensitive to small changes in the input, we argue that effective data compression schemes do not necessarily have high sensitivity. Our primary technical contribution is analyzing the fine-grained sensitivity of the LZ77 compression scheme (IEEE Trans. Inf. Theory 1977) which is one of the most common compression schemes used in practice. We show that the global sensitivity of the LZ77 compression scheme has the upper bound $\mathcal{O}(W^{2/3}\log n)$ where $W\leq n$ denotes the size of the sliding window. When $W=n$, we show the lower bound $\Omega(n^{2/3}\log^{1/3}n)$ for the global sensitivity of the LZ77 compression scheme which is tight up to a sublogarithmic factor.

翻译：我们针对流行的“先压缩后加密”框架的安全性问题，首次开展了差分隐私数据压缩方案的研究。数据压缩是一种利用数据冗余性来减少文件存储或传输时所需存储空间/带宽的有效工具。然而，若文件内容涉及机密信息，压缩后文件的长度可能会泄露关于文件内容本身的敏感信息。对压缩文件进行加密并不能消除这种泄露，因为数据加密方案仅被设计用于隐藏机密消息的内容，而非消息的长度。在我们提出的差分隐私先压缩后加密框架中，我们向压缩文件添加随机正数量的填充数据，以确保任何信息泄露均满足$(\epsilon,\delta)$-差分隐私的严格隐私保障。所需添加的填充量取决于压缩方案对输入微小变化的敏感性，即输入消息中单个字符的改变能在多大程度上影响压缩文件的长度。尽管某些流行的压缩方案对输入的微小变化高度敏感，我们认为有效的数据压缩方案未必具有高敏感性。我们的主要技术贡献在于分析了LZ77压缩方案（IEEE Trans. Inf. Theory 1977）的细粒度敏感性，该方案是实践中最常用的压缩方案之一。我们证明LZ77压缩方案的全局敏感性具有上界$\mathcal{O}(W^{2/3}\log n)$，其中$W\leq n$表示滑动窗口的大小。当$W=n$时，我们证明LZ77压缩方案的全局敏感性下界为$\Omega(n^{2/3}\log^{1/3}n)$，该结果在亚对数因子范围内是紧致的。