This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and variations thereof. After explaining how to achieve marginal coverage for exchangeable random queries, we extend our solution to provide stronger inferences that can account for the discreteness of the data and for heterogeneous query frequencies, increasing also robustness to possible distribution shifts. These results are facilitated by a novel conformal calibration technique that guarantees valid coverage for a large fraction of distinct random queries. Finally, we show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations as well as in examples of text and SARS-CoV-2 DNA data.
翻译:本文开发了保形推断方法,用于在低内存占用草图基础上构建超大离散数据集中查询对象频率的置信区间。该方法无需了解数据分布,可与任意草图算法结合,包括但不限于著名的计数最小草图、计数草图及其变体。在阐述如何实现可交换随机查询的边缘覆盖率后,我们扩展了解决方案以提供更强推断能力,既能考虑数据的离散性和异构查询频率,又能增强对可能分布偏移的鲁棒性。这些成果得益于一种新型保形校准技术,该技术可保证大量独立随机查询的有效覆盖率。最后,我们通过模拟实验及文本与SARS-CoV-2 DNA数据实例证明,本方法相较于现有频率学派和贝叶斯替代方法具有更优的经验性能。