Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation. We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. Among these lessons, we discuss the primacy of hardware advancement in shaping the availability and importance of scale, as well as the urgent challenge of quality evaluation, both automated and human. We argue that disparities in scale are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.
翻译:许多NLP研究者正因ChatGPT及其他基于大语言模型(LLMs)的系统的惊人成功而陷入生存危机。在我们对该领域的认知经历如此颠覆性变革后,我们还能做什么?通过历史视角,我们从始于2005年基于大规模$n$-gram模型的机器翻译的LLMs第一时代中寻求指引。我们识别出第一时代的持久教训,更重要的是,我们发现了那些在LLMs占据主导地位的领域中,NLP研究者仍能做出有意义贡献的常青问题。在这些教训中,我们讨论了硬件进步在塑造规模化可用性与重要性方面的首要地位,以及质量评估(包括自动评估与人工评估)的迫切挑战。我们认为,规模差异是暂时的,研究者可致力于缩小差异;数据(而非硬件)仍是许多实际应用中的瓶颈;基于实际使用场景的有意义评估仍是悬而未决的问题;而推测性方法仍有发展空间。