Large language models (LLMs) represent a new paradigm for processing unstructured data, with applications across an unprecedented range of domains. In this paper, we address, through two arguments, whether the development and application of LLMs would genuinely benefit from foundational contributions from the statistics discipline. First, we argue affirmatively, beginning with the observation that LLMs are inherently statistical models due to their profound data dependency and stochastic generation processes, where statistical insights are naturally essential for handling variability and uncertainty. Second, we argue that the persistent black-box nature of LLMs -- stemming from their immense scale, architectural complexity, and development practices often prioritizing empirical performance over theoretical interpretability -- renders closed-form or purely mechanistic analyses generally intractable, thereby necessitating statistical approaches due to their flexibility and often demonstrated effectiveness. To substantiate these arguments, the paper outlines several research areas -- including alignment, watermarking, uncertainty quantification, evaluation, and data mixture optimization -- where statistical methodologies are critically needed and are already beginning to make valuable contributions. We conclude with a discussion suggesting that statistical research concerning LLMs will likely form a diverse ``mosaic'' of specialized topics rather than deriving from a single unifying theory, and highlighting the importance of timely engagement by our statistics community in LLM research.
翻译:大型语言模型(LLMs)代表了一种处理非结构化数据的新范式,其应用范围前所未有地广泛。本文通过两个论点探讨LLMs的开发与应用是否真正受益于统计学学科的基础性贡献。首先,我们持肯定观点,基于以下观察:LLMs本质上是统计模型,这源于其深刻的数据依赖性和随机生成过程,其中统计洞见对于处理变异性和不确定性自然至关重要。其次,我们认为,LLMs持续存在的黑箱特性——源于其巨大的规模、架构复杂性以及通常优先考虑经验性能而非理论可解释性的开发实践——使得闭式或纯机制分析通常难以处理,从而需要统计方法,因为后者具有灵活性且已多次证明其有效性。为证实这些论点,本文概述了若干研究领域——包括对齐、水印、不确定性量化、评估和数据混合优化——这些领域亟需统计方法,并且统计方法已开始做出有价值的贡献。最后,我们讨论指出,关于LLMs的统计研究很可能形成一个由专门主题构成的多样化“马赛克”,而非源自某个单一的统一理论,并强调我们的统计学界及时参与LLM研究的重要性。