Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system's effectiveness in purifying the data.
翻译:当前,对最新且可靠的大语言模型(LLMs)的需求持续存在。通常,大语言模型在固定数据集上训练后即部署使用。然而,训练数据会不断过时。利用网页数据实现人工智能的自动训练,由于存在偏见、垃圾信息及其他不安全或不期望的文本,引发了关于数据质量与安全性的重大关切。纯净的数据对于产生可靠的模型至关重要。在不纯净的数据上训练模型可能导致不良后果。本研究提出一个系统,该系统收集网页数据,并借助现有可信人工智能模型自动过滤掉不期望的文本。实验中,收集并过滤了一小部分网页数据样本,结果证明了该系统在数据净化方面的有效性。