大型语言模型是否已过时？时间泛化能力的深度审视 (Is Your LLM Outdated? A Deep Look at Temporal Generalization)

The rapid advancement of Large Language Models (LLMs) has led to the development of benchmarks that consider temporal dynamics, however, there remains a gap in understanding how well these models can generalize across temporal contexts due to the inherent dynamic nature of language and information. This paper introduces the concept of temporal generalization in LLMs, including bias in past and future generalizations. Then we introduce FreshBench, a new evaluation framework that employs fresh text and event prediction for assessing LLMs' temporal adaptability, ensuring the evaluation process free from data leakage and subjective bias. The experiment shows significant temporal biases and a decline in performance over time. Our findings reveal that powerful models, while initially superior, tend to decline more rapidly in future generalization. Additionally, powerful open-source models demonstrate better long-term adaptability compared to their closed-source counterparts. Our code is available at https://github.com/FreedomIntelligence/FreshBench.

翻译：大型语言模型（LLMs）的快速发展催生了考虑时间动态性的评测基准，然而，由于语言与信息固有的动态特性，对于这些模型在时间语境中的泛化能力仍缺乏深入理解。本文提出了LLMs时间泛化的概念体系，涵盖过去与未来泛化中的偏差问题。进而我们提出FreshBench——一个通过新鲜文本与事件预测来评估LLMs时间适应性的新型评测框架，该框架确保评估过程免受数据泄露和主观偏差的影响。实验结果表明模型存在显著的时间偏差，且性能随时间推移而下降。研究发现：尽管强大模型在初始阶段表现优异，但其在未来泛化中的性能衰退更为迅速。此外，与闭源模型相比，强大的开源模型展现出更优的长期适应性。相关代码已发布于https://github.com/FreedomIntelligence/FreshBench。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日