We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.
翻译:本文从一个新颖的统计视角,重新审视了计算机科学与信息论交叉领域的一个经典问题:如何仅通过随机哈希得到的压缩表示(即草图)来恢复大型离散数据集中某个符号的经验频率。与传统的算法思路不同,近期研究提出了贝叶斯非参数方法,其能通过利用关于草图数据分布的建模假设,提供信息更丰富的频率估计。本文提出了一种平滑贝叶斯方法,该方法受到现有贝叶斯非参数方法的启发,但设计于频率学派的框架内,旨在克服贝叶斯非参数方法在处理来自现实分布(包括具有幂律尾行为的数据)的大规模数据时的计算局限性。对于使用单一哈希函数获得的草图,我们的方法得到了严格的频率学派性质的支持,包括在一个直观的线性估计量类别内,在平方误差损失函数下的无偏性和最优性。对于使用多个哈希函数的草图,我们引入了一种基于多视图学习的方法来构建计算高效的频率估计量。我们在合成数据和真实数据上验证了我们的方法,并将其性能与现有替代方案进行了比较。