Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality -- while uniquely enabling reliable identification of zero-match predicates through its principled thresholds.
翻译:表格数据嵌入已成为数据剖析和数据集成管道的基石,支持实体标注与解析、模式匹配、列类型检测以及表格搜索等多种任务。现有方法将行、列或整个表格嵌入到向量空间中,并依赖最近邻搜索来检索候选匹配。当前嵌入方法的一个根本局限在于缺乏可解释的相似度评分:查询与其最近邻之间的具体相似值本身没有内在意义,因此无法判断该邻居是真正的匹配,还是语料库中不包含有效答案时最不相似的条目。这种无法设定原则性检索阈值的缺陷阻碍了实际部署,尤其是在零匹配检测场景中。我们研究了超维计算(HDC)的应用,具体采用全息约化表示(HRR)模型,作为表格行嵌入的框架,当检索任务对应于在向量空间中回答结构化选择-投影查询时。利用HDC操作的代数性质,我们推导了等式和非等式检索谓词的闭式期望相似值,这些值随维度增加收敛到可解释的值,并据此确定合适的检索阈值。我们在两个真实数据集上,针对不同表格大小和谓词长度,将HDC与基于图的基线方法EmbDI进行了对比。结果表明,在所有配置下,HDC在行检索方面达到或优于EmbDI,对非等式谓词处理更鲁棒,在足够维度下实现完美的属性投影精度,同时通过其原则性阈值独占地实现了对零匹配谓词的可靠识别。