格式作为先验：量化与分析大语言模型在异构数据处理中的偏差 (Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data)

Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.

翻译：大语言模型（LLMs）日益广泛地应用于需要处理来自异构格式信息的场景，包括文本、表格、信息框和知识图谱。然而，系统性地偏向特定格式可能会削弱LLMs公正整合异构数据的能力，可能导致推理错误并增加下游任务的风险。目前尚不清楚此类偏差是否具有系统性、哪些数据层面的因素驱动了这些偏差，以及其产生的内部机制是什么。本文通过三阶段实证分析，首次对LLMs中的格式偏差进行了全面研究。第一阶段探究了多种LLMs中是否存在偏差及其偏向方向。第二阶段考察了关键数据层面因素如何影响这些偏差。第三阶段分析了格式偏差如何在LLMs的注意力模式中出现，并评估了一种轻量级干预措施以检验其有效性。我们的结果表明，格式偏差在不同模型家族中具有一致性，由信息丰富度、结构质量和表示类型所驱动，并且与LLMs内部的注意力失衡密切相关。基于这些研究，我们提出了三个未来研究方向以减少格式偏差：通过格式修复与规范化增强数据预处理，引入推理时干预（如注意力重加权），以及开发格式平衡的训练语料库。这些方向将有助于设计更鲁棒、更公平的异构数据处理系统。