Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.
翻译:理解上下文是理解人类语言的关键,而大型语言模型(LLMs)在这一能力上已展现出令人瞩目的水平。然而,尽管对LLMs的评估涵盖了自然语言处理领域的多个方面,但对其理解上下文特征的语言能力探究仍显不足。本文通过改编现有数据集以适配生成式模型的评估,提出了一项上下文理解基准测试。该基准测试包含四项不同的任务和九个数据集,所有任务均设计了旨在评估模型理解上下文能力的提示语。首先,我们在上下文学习预训练场景下评估了LLMs的性能。实验结果表明,与当前最优的微调模型相比,预训练的稠密模型在理解更细微的上下文特征方面存在困难。其次,鉴于LLM压缩在研究和实际应用中日益重要,我们评估了量化模型在上下文学习设置下的上下文理解能力。研究发现,3比特训练后量化会导致模型在基准测试中的性能出现不同程度的下降。我们对这些场景进行了广泛分析,以验证实验结果的可靠性。