We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$. Our main result is the first data structure that, given a sublinear (in $n$) number of samples from $p$, identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. (2018) that reports $v_i$ in time linear in $k$. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work.
翻译:我们研究以下密度估计问题的统计/计算权衡:给定大小为 $n$ 的离散域上的 $k$ 个分布 $v_1, \ldots, v_k$,并具有对分布 $p$ 的采样访问权限,识别与 $p$ “接近”的 $v_i$。我们的主要结果是首个数据结构,该结构利用来自 $p$ 的次线性(相对于 $n$)样本量,能在 $k$ 的次线性时间内识别 $v_i$。我们还给出了 Acharya 等人(2018)算法的一个改进版本,该版本能在 $k$ 的线性时间内报告 $v_i$。对后一个算法的实验评估表明,与先前工作相比,它在达到给定精度所需的操作次数上实现了显著减少。