This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general ``traits'' setting, where each data point has integer levels of association with multiple symbols, typically referred to as ``traits''. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions.
翻译:本文提出了一种贝叶斯非参数方法,用于从有损压缩的离散数据中恢复频率分布,该方法充分利用通过随机哈希获得的草图所包含的全部信息。通过将数据点建模为来自具有泊松-金曼先验的未知离散分布的随机样本,我们推导出给定草图时符号经验频率的后验分布。这通过均值泛函(如后验均值、中位数和众数)产生了具有理论依据的频率估计。我们重点展示了这一通用结果在狄利克雷过程和皮特曼-约尔过程先验中的应用。特别地,我们证明了前者是唯一满足充分性性质的先验,该性质能简化后验分布;而后者则便于进行大样本渐近近似。此外,我们将该方法扩展到基数恢复问题,用于估计草图中不同符号的数量。我们的频率恢复方法还可适配更一般的"特征"场景,其中每个数据点与多个符号(通常称为"特征")具有整数级别的关联。通过采用广义印度自助餐过程,我们分别使用泊松分布和伯努利分布对特征关联级别建模,计算特征频率的后验分布,从而分别得到精确和近似的后验频率分布。