Improving Robustness of Tabular Retrieval via Representational Stability

Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as $\texttt{csv}$, $\texttt{tsv}$, $\texttt{html}$, $\texttt{markdown}$, and $\texttt{ddl}$, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across $\texttt{MPNet}$, $\texttt{BGE-M3}$, $\texttt{ReasonIR}$, and $\texttt{SPLADE}$. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at $\href{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}$.

翻译：基于Transformer的表格检索系统将结构化表格展平为令牌序列，这使得检索结果对序列化方式的选择高度敏感，即使表格语义保持不变。我们证明，语义等价的序列化方式（如$\texttt{csv}$、$\texttt{tsv}$、$\texttt{html}$、$\texttt{markdown}$和$\texttt{ddl}$）在多个基准测试和检索器家族中会产生显著不同的嵌入表示和检索结果。为解决这种不稳定性，我们将序列化嵌入视为共享语义信号的有噪视图，并以其质心作为规范的目标表示。我们证明，当格式引发的偏移在不同表格间存在差异时，质心平均能抑制格式特有的变异，并恢复不同序列化方式间共同的语义内容。实验表明，在$\texttt{MPNet}$、$\texttt{BGE-M3}$、$\texttt{ReasonIR}$和$\texttt{SPLADE}$等模型的成对综合比较中，质心表示的性能优于单个格式。我们进一步在冻结编码器之上引入轻量级残差瓶颈适配器，将单序列化嵌入映射至质心目标，同时保留方差并施加协方差正则化。该适配器提升了多种稠密检索器的鲁棒性，但其改进效果依赖模型类型，且对稀疏词项检索的效果较弱。这些结果确定了序列化敏感性是检索方差的主要来源，并展示了后验几何校正实现序列化不变表格检索的潜力。我们的代码、数据集和模型已开源在$\href{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}$。