Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.
翻译:训练后模型压缩对于增强大语言模型(LLMs)的可移植性同时保持其性能至关重要。尽管已有多种压缩方法被提出,但针对寻找压缩模型配置时如何选择最合适的数据集(即所谓“标定数据”)的研究仍相对不足。标定数据的选择是保持模型在任务内与跨任务能力的关键步骤。本研究通过分析数据内在属性而非模型特定信号,解决了为剪枝和量化识别高性能标定集的挑战。我们提出**ZipCal**,一种基于齐普夫幂律最大化词汇多样性的模型无关数据策展策略。实验表明,该方法在多种剪枝基准测试中始终优于标准均匀随机采样。值得注意的是,在下游性能方面,该方法与依赖模型困惑度的最先进方法表现相当。后者在大规模模型与数据集场景下计算成本过高,而**ZipCal**凭借其易处理的线性复杂度,平均速度提升约240倍(代码与实验数据见https://github.com/FrancescoMonaco/ZipCal)。