Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we show that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.
翻译:大型语言模型(LLMs)和生成式人工智能在计算机研究与应用中发挥了变革性作用。然而,关于这些模型是否会输出受版权保护的数据(当模型训练数据包含版权内容时可能发生)已引发争议。LLMs基于Transformer神经网络架构构建,该架构又依赖于一种名为注意力机制(Attention)的数学计算,后者使用了softmax函数。本文证明,大型语言模型的训练与优化可被视为一个softmax回归问题。我们随后建立了一种高效执行softmax回归的方法,该方法能防止回归函数生成版权数据。这为通过避免生成版权数据的方式训练大型语言模型提供了一种理论方法。