Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as $\texttt{csv}$, $\texttt{tsv}$, $\texttt{html}$, $\texttt{markdown}$, and $\texttt{ddl}$, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across $\texttt{MPNet}$, $\texttt{BGE-M3}$, $\texttt{ReasonIR}$, and $\texttt{SPLADE}$. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval.
翻译:基于Transformer的表格检索系统将结构化表格展平为token序列,这使得检索结果对序列化方式的选择高度敏感,即使表格语义保持不变。本研究表明,语义等价的序列化方式(如$\texttt{csv}$、$\texttt{tsv}$、$\texttt{html}$、$\texttt{markdown}$和$\texttt{ddl}$)在多个基准测试和检索器家族中会产生显著不同的嵌入表征和检索结果。为解决这一不稳定性问题,我们将序列化嵌入视为共享语义信号的带噪视图,并以其质心作为规范化的目标表征。我们证明,质心平均能够抑制格式特异性变异,且在格式诱导的偏移因表格而异时,可恢复不同序列化方式共有的语义内容。实验表明,在$\texttt{MPNet}$、$\texttt{BGE-M3}$、$\texttt{ReasonIR}$和$\texttt{SPLADE}$检索器上的成对聚合比较中,质心表征的整体表现优于单一格式。我们进一步引入一种轻量级残差瓶颈适配器,该适配器置于冻结编码器之上,可将单序列化嵌入映射至质心目标,同时保留方差并施加协方差正则化。该适配器提升了多个密集检索器的鲁棒性,但增益具有模型依赖性,且对稀疏词汇检索的效果较弱。这些结果揭示了序列化敏感性是检索差异的重要来源,并展示了基于后验几何校正方法实现序列化无关表格检索的潜力。