Sample-Optimal Locally Private Hypothesis Selection and the Provable Benefits of Interactivity

We study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $\varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm breaks this barrier using only $\Theta(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.

翻译：我们在局部差分隐私约束下研究假设选择问题。给定一个由 $k$ 个分布组成的类别 $\mathcal{F}$，以及来自未知分布 $h$ 的一组独立同分布样本，假设选择的目标是选择一个分布 $\hat{f}$，使其与 $h$ 的总变差距离以高概率与 $\mathcal{F}$ 中的最优分布相当。我们设计了一种 $\varepsilon$-局部差分隐私算法，该算法使用 $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ 个样本，以保证 $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ 以高概率成立。当 $\varepsilon<1$ 时，此样本复杂度是最优的，与 Gopi 等人（2020）的下界相匹配。此前所有已知的该问题的算法需要 $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ 个样本。此外，我们的结果展示了 $\varepsilon$-LDP 假设选择中交互性的力量，即它突破了非交互式假设选择样本复杂度的已知下界 $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$。我们的算法仅使用 $\Theta(\log \log k)$ 轮交互就突破了这个障碍。为了证明我们的结果，我们为统计查询算法定义了“关键查询”的概念，这可能具有独立的研究价值。非正式地说，如果一个 SQA 的成功仅依赖于其提出查询中少数几个的准确性，则该 SQA 使用的关键查询数量较少。然后，我们设计了一个使用更少关键查询的 LDP 算法。