Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
翻译:标注训练数据的稀缺性始终是构建高性能语言技术与生成式人工智能模型的主要瓶颈。Transformer模型——尤其是大语言模型(LLMs)——正日益被用于通过合成数据生成来缓解数据稀缺问题。然而,由于这些模型是黑箱系统,合成数据的特性难以预测。实践中,语言技术工程师通常通过反复调整LLM的温度设置,并期望其输出能提升下游模型的性能。面对这种不确定性,本文提出数据核视角空间(DKPS),为数学分析奠定基础,从而为Transformer模型输出的质量提供具体的统计保证。我们首先展示了DKPS的数学推导及其如何提供性能保证。接着,我们说明DKPS性能保证如何能够阐明下游任务的性能表现,例如神经机器翻译模型或使用对比偏好优化(CPO)训练的大语言模型。本文还讨论了当前工作的局限性以及未来的研究方向。