Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also with minimizing the amount of data they must transmit to the server. We give improved private and communication-efficient algorithms for estimating several popular measures of the entropy of a distribution. All of our algorithms have constant communication cost and satisfy local differential privacy. For a joint distribution over many variables whose conditional independence is given by a tree, we describe algorithms for estimating Shannon entropy that require a number of samples that is linear in the number of variables, compared to the quadratic sample complexity of prior work. We also describe an algorithm for estimating Gini entropy whose sample complexity has no dependence on the support size of the distribution and can be implemented using a single round of concurrent communication between the users and the server. In contrast, the previously best-known algorithm has high communication cost and requires the server to facilitate interaction between the users. Finally, we describe an algorithm for estimating collision entropy that generalizes the best known algorithm to the private and communication-efficient setting.
翻译:现代统计估计通常在分布式环境中进行,其中每个样本归属于单个用户,该用户将其数据共享给中央服务器。用户通常关注其样本的隐私保护,同时希望最小化需传输至服务器的数据量。我们提出了改进的隐私保护与通信高效算法,用于估计分布的几种常见熵度量。所有算法均具有恒定通信成本,并满足本地差分隐私。针对由树结构表示条件独立性的多变量联合分布,我们描述了估计香农熵的算法,其所需样本数量与变量数量呈线性关系,而先前工作的样本复杂度为二次方。我们还描述了一种估计基尼熵的算法,其样本复杂度与分布的支持集大小无关,且可通过用户与服务器之间单轮并发通信实现。相比之下,先前最佳算法具有高通信成本,且需要服务器协调用户间的交互。最后,我们描述了一种估计碰撞熵的算法,该算法将已知最佳算法推广至隐私保护与通信高效场景。