Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.
翻译:过去几年,大型语言模型(LLMs)的能力取得了前所未有的进步。这些进展有望深刻惠及众多应用领域。然而,由于其庞大的规模,执行LLM推理既昂贵又缓慢。因此,近期大量研究提出了提升推理效率的策略,例如量化、剪枝和缓存。这些加速策略通常在保持通过常见基准测试衡量的预测性能的同时,将推理成本和延迟降低数倍。在本研究中,我们探讨了LLM性能的另一个关键方面:由推理加速优化导致的模型生成结果中的群体偏见。我们使用多种指标,从多个角度探究模型输出中的偏见。对推理加速前后输出的分析表明,偏见发生了显著变化。令人担忧的是,这些偏见效应复杂且难以预测。某种加速策略与偏见类型的组合可能在一个模型中显示出微小的偏见变化,但在另一个模型中可能导致巨大影响。我们的结果强调,在模型经过修改以加速推理后,需要对其偏见进行深入且逐案评估。