Our aim is to estimate the largest community (a.k.a., mode) in a population composed of multiple disjoint communities. This estimation is performed in a fixed confidence setting via sequential sampling of individuals with replacement. We consider two sampling models: (i) an identityless model, wherein only the community of each sampled individual is revealed, and (ii) an identity-based model, wherein the learner is able to discern whether or not each sampled individual has been sampled before, in addition to the community of that individual. The former model corresponds to the classical problem of identifying the mode of a discrete distribution, whereas the latter seeks to capture the utility of identity information in mode estimation. For each of these models, we establish information theoretic lower bounds on the expected number of samples needed to meet the prescribed confidence level, and propose sound algorithms with a sample complexity that is provably asymptotically optimal. Our analysis highlights that identity information can indeed be utilized to improve the efficiency of community mode estimation.
翻译:我们的目标是估计由多个互不相交社区组成的总体中最大的社区(即模式)。该估计在固定置信度设置下通过有放回的序贯个体抽样进行。我们考虑两种抽样模型:(i)无身份模型,其中仅揭示每个被抽样个体的社区归属;(ii)基于身份模型,其中学习器除了能获知个体所属社区外,还能辨别每个被抽样个体是否曾被抽样过。前者对应于离散分布模式识别的经典问题,而后者旨在捕捉身份信息在模式估计中的效用。针对每种模型,我们建立了在满足预设置信水平下所需期望样本数的信息论下界,并提出了样本复杂度可证明为渐近最优的合理算法。我们的分析表明,身份信息确实可用于提升社区模式估计的效率。