Given a lossy-compressed representation, or sketch, of data with values in a set of symbols, the frequency recovery problem considers the estimation of the empirical frequency of a new data point. Recent studies have applied Bayesian nonparametrics (BNPs) to develop learning-augmented versions of the popular count-min sketch (CMS) recovery algorithm. In this paper, we present a novel BNP approach to frequency recovery, which is not built from the CMS but still relies on a sketch obtained by random hashing. Assuming data to be modeled as random samples from an unknown discrete distribution, which is endowed with a Poisson-Kingman (PK) prior, we provide the posterior distribution of the empirical frequency of a symbol, given the sketch. Estimates are then obtained as mean functionals. An application of our result is presented for the Dirichlet process (DP) and Pitman-Yor process (PYP) priors, and in particular: i) we characterize the DP prior as the sole PK prior featuring a property of sufficiency with respect to the sketch, leading to a simple posterior distribution; ii) we identify a large sample regime under which the PYP prior leads to a simple approximation of the posterior distribution. Then, we develop our BNP approach to a "traits" formulation of the frequency recovery problem, not yet studied in the CMS literature, in which data belong to more than one symbol (trait), and exhibit nonnegative integer levels of associations with each trait. In particular, by modeling data as random samples from a generalized Indian buffet process, we provide the posterior distribution of the empirical frequency level of a trait, given the sketch. This result is then applied under the assumption of a Poisson and Bernoulli distribution for the levels of associations, leading to a simple posterior distribution and a simple approximation of the posterior distribution, respectively.
翻译:对于值取自符号集的有损压缩表示(即草图),频率恢复问题旨在估计新数据点的经验频率。近年研究将贝叶斯非参数方法应用于开发流行的计数最小草图恢复算法的学习增强版本。本文提出一种新颖的贝叶斯非参数频率恢复方法,该方法并非源自计数最小草图,但仍依赖于通过随机哈希获得的草图。假设数据被建模为来自未知离散分布的随机样本,并赋予泊松-金曼先验,我们给出了给定草图条件下符号经验频率的后验分布,进而通过均值泛函获得估计。我们的结果在狄利克雷过程和皮特曼-约尔过程先验下得到应用,具体包括:i) 将狄利克雷过程刻画为唯一具有关于草图充分性性质的泊松-金曼先验,从而得到简洁的后验分布;ii) 识别出大样本机制,在该机制下皮特曼-约尔过程先验可得到后验分布的简单近似。随后,我们将贝叶斯非参数方法发展为频率恢复问题的"特征"形式(该形式在计数最小草图文献中尚未被研究),其中数据同时关联多个符号(特征),且与每个特征呈现非负整数级别的关联强度。具体而言,通过将数据建模为广义印度自助餐过程的随机样本,我们推导了给定草图条件下特征经验频率级别的后验分布。该结果分别在关联强度服从泊松分布和伯努利分布的假设下得到应用,分别得到简洁的后验分布及其简单近似。