Statistical inference on histograms and frequency counts plays a central role in categorical data analysis. Moving beyond classical methods that directly analyze labeled frequencies, we introduce a framework that models the multiset of unlabeled histograms via a mixture distribution to better capture unseen domain elements in large-alphabet regime. We study the nonparametric maximum likelihood estimator (NPMLE) under this framework, and establish its optimal convergence rate under the Poisson setting. The NPMLE also immediately yields flexible and efficient plug-in estimators for functional estimation problems, where a localized variant further achieves the optimal sample complexity for a wide range of symmetric functionals. Extensive experiments on synthetic, real-world datasets, and large language models highlight the practical benefits of the proposed method.
翻译:对直方图和频数进行统计推断在分类数据分析中占据核心地位。本文超越了直接分析带标签频数的经典方法,提出了一种通过混合分布对无标签直方图的多重集进行建模的框架,以更好地捕捉大字母表场景中未见过的领域元素。我们研究了该框架下的非参数极大似然估计量,并在泊松设定下确立了其最优收敛速率。该非参数极大似然估计量还可直接为函数估计问题提供灵活高效的插件估计量,其局部化变体进一步为广泛的对称泛函实现了最优样本复杂度。在合成数据集、真实世界数据集以及大型语言模型上的大量实验验证了所提方法的实际优势。