With the recent emergence of powerful instruction-tuned large language models (LLMs), various helpful conversational Artificial Intelligence (AI) systems have been deployed across many applications. When prompted by users, these AI systems successfully perform a wide range of tasks as part of a conversation. To provide some sort of memory and context, such approaches typically condition their output on the entire conversational history. Although this sensitivity to the conversational history can often lead to improved performance on subsequent tasks, we find that performance can in fact also be negatively impacted, if there is a task-switch. To the best of our knowledge, our work makes the first attempt to formalize the study of such vulnerabilities and interference of tasks in conversational LLMs caused by task-switches in the conversational history. Our experiments across 5 datasets with 15 task switches using popular LLMs reveal that many of the task-switches can lead to significant performance degradation.
翻译:随着近期强大指令微调大语言模型(LLMs)的出现,各种实用的对话式人工智能系统已被部署到众多应用中。当用户发出提示时,这些AI系统能在对话中成功执行多种任务。为了提供某种形式的记忆和上下文,这类方法通常基于整个对话历史来生成输出。尽管对对话历史的敏感性往往能提升后续任务的性能,但我们发现,若存在任务切换,性能实际上也可能受到负面影响。据我们所知,我们的工作首次尝试系统化地研究对话LLM中由对话历史中任务切换导致的此类脆弱性和任务干扰。我们在5个数据集上使用主流LLMs进行15种任务切换的实验表明,许多任务切换会导致显著的性能下降。