Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue

The set of answers to a query may be very large, potentially overwhelming users when presented with the entire set. In such cases, presenting only a small subset of the answers to the user may be preferable. A natural requirement for this subset is that it should be as diverse as possible to reflect the variety of the entire population. To achieve this, the diversity of a subset is measured using a metric that determines how different two solutions are and a diversity function that extends this metric from pairs to sets. In the past, several studies have shown that finding a diverse subset from an explicitly given set is intractable even for simple metrics (like Hamming distance) and simple diversity functions (like summing all pairwise distances). This complexity barrier becomes even more challenging when trying to output a diverse subset from a set that is only implicitly given such as the query answers of a query and a database. Until now, tractable cases have been found only for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, which have been widely studied and used in many applications. Starting from any ultrametric $d$ and a diversity function $\delta$ extending $d$, we provide sufficient conditions over $\delta$ for having polynomial-time algorithms to construct diverse answers. To the best of our knowledge, these conditions are satisfied by all diversity functions considered in the literature. Moreover, we complement these results with lower bounds that show specific cases when these conditions are not satisfied and finding diverse subsets becomes intractable. We conclude by applying these results to the evaluation of conjunctive queries, demonstrating efficient algorithms for finding a diverse subset of solutions for acyclic conjunctive queries when the attribute order is used to measure diversity.

翻译：查询结果集可能非常庞大，当向用户呈现完整集合时，极易造成信息过载。在此类场景中，仅向用户呈现答案的小型子集可能更为适宜。该子集需满足的核心要求是应尽可能保持多样性，以反映整体答案集的分布特征。为实现这一目标，子集的多样性通过以下方式度量：采用衡量两个解之间差异性的度量函数，以及将该度量从二元关系扩展至集合的多样性函数。既往研究表明，即使对于简单度量（如汉明距离）和基础多样性函数（如对所有成对距离求和），从显式给定的集合中寻找多样性子集也属于难解问题。当尝试从隐式给定的集合（如查询与数据库产生的查询答案集）中输出多样性子集时，该复杂性障碍将更为严峻。迄今为止，仅能在受限问题与特定多样性函数中发现可处理案例。为突破这些限制，本研究聚焦于超度量这一在多领域被广泛研究与应用的概念。基于任意超度量$d$及扩展$d$的多样性函数$\delta$，我们提出关于$\delta$的充分条件，使得构建多样性答案的多项式时间算法存在。据我们所知，现有文献中所有多样性函数均满足这些条件。此外，我们通过下界结果补充证明：当不满足这些条件时，寻找多样性子集将变为难解问题。最后，我们将这些结果应用于合取查询评估，展示了在无环合取查询中利用属性顺序度量多样性时，寻找解集多样性子集的高效算法。