A Theoretical Framework for Distribution-Aware Dataset Search

Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. Let $\mathcal{P}=\{P_1,...,P_N\}$ be a repository of $N$ datasets, where $P_i\subset \mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $|P_j\cap R|/|P_j|\in\theta$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $\omega(P_j,v)\in \theta$, where $\omega(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $\tilde{O}(N)$ space data structures that answer Ptile and Pref queries in $\tilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $i\in J$ and ii) if $j\in J$ then $P_j$ satisfies the predicate up to an additive error $\varepsilon+2\delta$, where $\varepsilon\in(0,1)$ and $\delta$ is the error of synopses.

翻译：有效的数据发现是现代数据驱动决策的基石。然而，识别具有特定分布特征（如百分位数或偏好）的数据集仍然具有挑战性。尽管近期的研究提议允许用户基于百分位数谓词进行搜索，但数据发现领域的大部分研究仍依赖于启发式方法。本文提出了首个具有理论支撑的框架，该框架统一了集中式与去中心化设置下的数据发现问题。令 $\mathcal{P}=\{P_1,...,P_N\}$ 为一个包含 $N$ 个数据集的存储库，其中 $P_i\subset \mathbb{R}^d$，且 $d=O(1)$。我们研究了集中式与联邦式设置下的百分位数索引（Ptile）问题和偏好索引（Pref）问题。在集中式设置中，我们假设能直接访问数据集。在联邦式设置中，我们假设能访问每个数据集的概要。Ptile 问题的目标是构建一个数据结构，使得给定一个谓词（矩形 $R$ 和区间 $\theta$），能够报告所有索引 $J$，满足 $j\in J$ 当且仅当 $|P_j\cap R|/|P_j|\in\theta$。Pref 问题的目标是构建一个数据结构，使得给定一个谓词（向量 $v$ 和区间 $\theta$），能够报告所有索引 $J$，满足 $j\in J$ 当且仅当 $\omega(P_j,v)\in \theta$，其中 $\omega(P_j,v)$ 是 $P_j$ 在 $v$ 上投影的第 $k$ 大值的内积。我们首先证明，在集中式设置中，我们无法期望获得具有对数多项式查询时间的近线性数据结构。接着，我们展示了空间复杂度为 $\tilde{O}(N)$ 的数据结构，其能以 $\tilde{O}(1+OUT)$ 的时间复杂度回答 Ptile 和 Pref 查询，其中 $OUT$ 是输出大小。每个数据结构返回一个索引集合 $J$，满足：i) 对于每个满足谓词的数据集 $P_i$，有 $i\in J$；ii) 如果 $j\in J$，则 $P_j$ 在附加误差 $\varepsilon+2\delta$ 范围内满足谓词，其中 $\varepsilon\in(0,1)$，$\delta$ 是概要的误差。