While there are abundant researches about evaluating ChatGPT on natural language understanding and generation tasks, few studies have investigated how ChatGPT's behavior changes over time. In this paper, we collect a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily: ChatLog-Monthly is a dataset of 38,730 question-answer pairs collected every month including questions from both the reasoning and classification tasks. ChatLog-Daily, on the other hand, consists of ChatGPT's responses to 1000 identical questions for long-form generation every day. We conduct comprehensive automatic and human evaluation to provide the evidence for the existence of ChatGPT evolving patterns. We further analyze the unchanged characteristics of ChatGPT over time by extracting its knowledge and linguistic features. We find some stable features to improve the robustness of a RoBERTa-based detector on new versions of ChatGPT. We will continuously maintain our project at https://github.com/THU-KEG/ChatLog.
翻译:尽管已有大量研究评估ChatGPT在自然语言理解与生成任务上的表现,但鲜有工作探究其行为随时间的演变规律。本文构建了一个由月度更新与每日更新两部分组成的粗粒度到细粒度的时间序列数据集ChatLog:其中ChatLog-Monthly每月收集包含推理与分类任务的38,730个问答对;ChatLog-Daily则每日收录ChatGPT对1000个相同问题的长文本生成回答。我们通过全面的自动评估与人工评估,为ChatGPT的演化模式提供证据支持。进一步通过提取ChatGPT的知识特征与语言特征,分析其随时间保持不变的特性。基于这些稳定特征,我们提升了基于RoBERTa的检测器对新版本ChatGPT的鲁棒性。我们将持续维护该项目于https://github.com/THU-KEG/ChatLog。