One of the most fundamental tasks in data science is to assist a user with unknown preferences in finding high-utility tuples within a large database. To accurately elicit the unknown user preferences, a widely-adopted way is by asking the user to compare pairs of tuples. In this paper, we study the problem of identifying one or more high-utility tuples by adaptively receiving user input on a minimum number of pairwise comparisons. We devise a single-pass streaming algorithm, which processes each tuple in the stream at most once, while ensuring that the memory size and the number of requested comparisons are in the worst case logarithmic in $n$, where $n$ is the number of all tuples. An important variant of the problem, which can help to reduce human error in comparisons, is to allow users to declare ties when confronted with pairs of tuples of nearly equal utility. We show that the theoretical guarantees of our method can be maintained for this important problem variant. In addition, we show how to enhance existing pruning techniques in the literature by leveraging powerful tools from mathematical programming. Finally, we systematically evaluate all proposed algorithms over both synthetic and real-life datasets, examine their scalability, and demonstrate their superior performance over existing methods.
翻译:数据科学中最基础的任务之一是协助具有未知偏好的用户在大型数据库中寻找高效用元组。为了准确获取未知用户偏好,广泛采用的方法是要求用户对元组对进行比较。本文研究通过自适应接收用户输入的最小成对比较次数来识别一个或多个高效用元组的问题。我们提出了一种单遍流式算法,该算法对数据流中的每个元组最多处理一次,同时确保内存大小和请求的比较次数在最坏情况下均为$n$的对数级,其中$n$为所有元组总数。该问题的一个重要变体允许用户在面对效用相近的元组对时声明平局,这有助于减少比较中的人为错误。我们证明了该方法对此重要问题变体的理论保证仍然成立。此外,我们还展示了如何利用数学规划的强大工具来增强文献中现有的剪枝技术。最终,我们在合成数据集和真实数据集上对所有提出的算法进行了系统性评估,检验了其可扩展性,并证明了其相较现有方法的优越性能。