TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs, owing to both their simplicity and the models' potential exposure during instruction tuning. In this paper, we introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs. TRACE consists of 8 distinct datasets spanning challenging tasks including domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. All datasets are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities. For example, the accuracy of llama2-chat 13B on gsm8k dataset declined precipitously from 28.8\% to 2\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Empirical findings suggest that tasks inherently equipped with reasoning paths contribute significantly to preserving certain capabilities of LLMs against potential declines. Motivated by this, we introduce the Reasoning-augmented Continual Learning (RCL) approach. RCL integrates task-specific cues with meta-rationales, effectively reducing catastrophic forgetting in LLMs while expediting convergence on novel tasks.

翻译：对齐的大型语言模型（LLMs）在任务求解、指令遵循及安全保障方面展现出卓越能力。然而，这些对齐LLMs的持续学习方面在很大程度上被忽视了。现有持续学习基准因任务简单性及模型在指令微调期间可能存在的预暴露问题，难以对领先的对齐LLMs构成充分挑战。本文提出TRACE这一新型基准，旨在评估LLMs的持续学习能力。TRACE包含8个独立数据集，覆盖领域特定任务、多语言能力、代码生成及数学推理等具有挑战性的任务。所有数据集均统一为标准格式，支持对LLMs进行便捷的自动化评估。实验表明，经过TRACE训练后，对齐LLMs在通用能力和指令遵循能力上均出现显著下降。例如，在gsm8k数据集上，llama2-chat 13B的准确率从28.8%骤降至2%。这凸显了在提升特定任务性能与保持LLMs原有能力之间寻求恰当平衡的挑战。实证研究发现，天然具备推理路径的任务对维持LLMs的某些能力、防止其潜在衰退有显著贡献。受此启发，我们提出推理增强持续学习（RCL）方法。RCL将任务特定线索与元推理相结合，在加速新任务收敛的同时，有效减少LLMs中的灾难性遗忘。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日