Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.

翻译：数值表格数据集是科学实践中的主要数据格式，但大型语言模型缺乏在异构特征空间中有意义地表征数值数据集的原生机制。现有方法要么针对单个数据集的预测建模（需共享变量定义），要么缺乏可解释的跨数据集对齐机制。本文提出一种方法论：通过结构化探索性数据分析描述符表征数值表格数据集，利用预训练句子变换器将这些描述符嵌入共享向量空间，并通过典型相关分析量化跨数据集相似度。进一步采用惩罚形式的典型相关分析恢复数据集间稀疏、可解释的变量级对应关系，在不要求共享变量名或特征约定的前提下，识别驱动跨数据集对齐的统计描述符或变量级量值。可选的差分隐私机制在嵌入前应用于描述符集合，支持在敏感数据场景中部署而无需在比较时访问原始观测值。该方法在涵盖通用基准、材料信息学及核级石墨表征的15个数据集上进行了评估。结果表明总P@1得分为0.9，已知近邻检索与聚类结构在嵌入消融实验及差分隐私预算下保持鲁棒性。所提框架为将异构数值数据整合到检索增强生成流水线中提供了原则性路径，同时保留统计上下文，可直接应用于未知数据集的算法选择与仿真模型初始化。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

表格数据表示学习综述

专知会员服务

18+阅读 · 2025年4月27日

【剑桥博士论文】小样本高维数据上的表格机器学习

专知会员服务

18+阅读 · 2025年4月9日

《深度表格学习综述》

专知会员服务

44+阅读 · 2024年10月18日

表格数据的语言建模：基础、技术与演变综述

专知会员服务

39+阅读 · 2024年8月23日