Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textit{data proportion detection}, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.
翻译:大型语言模型(LLMs)在广泛的任务和领域中展现出卓越性能,其中数据准备对实现这些成果起着关键作用。预训练数据通常融合了多个领域的信息。为在整合多领域数据时实现性能最大化,确定最优数据比例至关重要。然而,当前最先进的LLMs很少公开其预训练数据的细节,这使得研究者难以确定理想的数据比例。本文提出一个新课题——\textit{数据比例检测},该方法通过分析LLMs的生成输出,能够自动估计预训练数据的比例。我们为数据比例检测提供了严格的理论证明、实用算法及初步实验结果。基于这些发现,我们针对有效数据比例检测与数据管理所面临的挑战及未来方向提出了重要见解。