The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.
翻译:大语言模型在长上下文场景中的部署受到计算效率低下和显著信息冗余的阻碍。尽管近期研究广泛采用上下文压缩技术以应对这些挑战,但现有工作仅聚焦于模型侧的改进,数据分布本身对上下文压缩的影响在很大程度上仍未得到探索。为填补这一空白,我们首次采用数据中心视角,系统性地研究数据分布如何影响压缩质量,涵盖两个维度:输入数据与内在数据(即模型内部预训练知识)。我们采用基于自编码器的框架评估压缩表示的语义完整性,并对其进行系统性探究。实验结果表明:(1)编码器测量的输入熵与压缩质量呈负相关,而在解码器冻结的设置下,解码器测量的熵则未呈现显著关联;(2)编码器与解码器内在数据之间的差距显著削弱了压缩增益,且难以缓解。基于这些发现,我们进一步提出了优化压缩增益的实用指导原则。