LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

翻译：大型语言模型（LLM）可通过领域特定数据进行微调，以提升其在专业领域的性能。然而，此类数据通常包含大量低质量样本，需要有效的数据处理（DP）。实践中，DP策略通常通过迭代式人工分析与试错调整来制定。这些过程不可避免地产生高昂的人力成本，且在医疗等高隐私领域，由于人员直接接触敏感数据，可能引发隐私问题。因此，如何在不暴露原始数据的前提下实现自动化数据处理已成为关键挑战。为应对这一挑战，我们提出LLM-AutoDP——一种利用LLM作为智能体自动生成并优化数据处理策略的新框架。该方法生成多个候选策略，并利用反馈信号与比较评估进行迭代优化。这种迭代式上下文学习机制使智能体能够收敛至高质量处理流程，无需人工直接干预或访问底层数据。为加速策略搜索，我们引入三项关键技术：分布保持采样，在减少数据量的同时保持分布完整性；处理目标选择，使用二元分类器识别低质量样本以进行针对性处理；缓存复用机制，通过复用先前处理结果最小化冗余计算。实验结果表明，经本框架处理数据训练的模型，相较于未经处理数据训练的模型，胜率超过80%。与基于LLM智能体的AutoML基线相比，LLM-AutoDP获得约65%的胜率。此外，我们的加速技术将总搜索时间最高降低至原来的十分之一，充分证明了方法的有效性与高效性。