In the pooled data problem we are given $n$ agents with hidden state bits, either $0$ or $1$. The hidden states are unknown and can be seen as the underlying ground truth $\sigma$. To uncover that ground truth, we are given a querying method that queries multiple agents at a time. Each query reports the sum of the states of the queried agents. Our goal is to learn the hidden state bits using as few queries as possible. So far, most literature deals with exact reconstruction of all hidden state bits. We study a more relaxed variant in which we allow a small fraction of agents to be classified incorrectly. This becomes particularly relevant in the noisy variant of the pooled data problem where the queries' results are subject to random noise. In this setting, we provide a doubly regular test design that assigns agents to queries. For this design we analyze an approximate reconstruction algorithm that estimates the hidden bits in a greedy fashion. We give a rigorous analysis of the algorithm's performance, its error probability, and its approximation quality. As a main technical novelty, our analysis is uniform in the degree of noise and the sparsity of $\sigma$. Finally, simulations back up our theoretical findings and provide strong empirical evidence that our algorithm works well for realistic sample sizes.
翻译:在池化数据问题中,我们考虑n个具有隐藏状态位(0或1)的智能体。这些隐藏状态未知,可视为潜在的真实状态σ。为揭示该真实状态,我们采用一种可同时查询多个智能体的方法,每次查询返回被查询智能体状态之和。我们的目标是用尽可能少的查询次数学习隐藏状态位。现有文献大多关注所有隐藏状态位的精确重构。本文研究一种更宽松的变体:允许少量智能体被错误分类。这在池化数据问题的含噪变体中尤为关键,因为查询结果会受随机噪声影响。针对该场景,我们提出一种将智能体分配给查询的"双重正则测试设计"。基于该设计,我们分析了一种采用贪心策略估计隐藏状态的近似重建算法。我们对该算法的性能、错误概率及近似质量进行了严格的理论分析。作为主要技术贡献,我们的分析在噪声强度与σ稀疏性上具有普适性。最后,仿真实验验证了我们的理论发现,并为该算法在合理样本量下的良好表现提供了有力的经验证据。