Model customization requires high-quality and diverse datasets, but acquiring such data remains challenging and costly. Although large language models (LLMs) can synthesize training data, current approaches are constrained by limited seed data, model bias and insufficient control over the generation process, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we present TreeSynth, a tree-guided subspace-based data synthesis framework that recursively partitions the entire data space into hierar-chical subspaces, enabling comprehensive and diverse scaling of data synthesis. Briefly, given a task-specific description, we construct a data space partitioning tree by iteratively executing criteria determination and subspace coverage steps. This hierarchically divides the whole space (i.e., root node) into mutually exclusive and complementary atomic subspaces (i.e., leaf nodes). By collecting synthesized data according to the attributes of each leaf node, we obtain a diverse dataset that fully covers the data space. Empirically, our extensive experiments demonstrate that TreeSynth surpasses both human-designed datasets and the state-of-the-art data synthesis baselines, achieving maximum improvements of 45.2% in data diversity and 17.6% in downstream task performance across various models and tasks. Hopefully, TreeSynth provides a scalable solution to synthesize diverse and comprehensive datasets from scratch without human intervention.
翻译:模型定制需要高质量且多样化的数据集,但获取此类数据仍然具有挑战性且成本高昂。尽管大语言模型(LLMs)能够合成训练数据,但现有方法受限于有限的种子数据、模型偏差以及对生成过程的控制不足,导致随着数据规模的增加,数据多样性受限且分布存在偏差。为应对这一挑战,我们提出了TreeSynth,一种基于树引导子空间的数据合成框架,通过递归地将整个数据空间划分为层次化子空间,实现全面且多样化的数据合成扩展。简而言之,给定任务特定描述,我们通过迭代执行准则确定和子空间覆盖步骤来构建数据空间划分树。这会将整个空间(即根节点)层次化地划分为互斥且互补的原子子空间(即叶节点)。通过根据每个叶节点的属性收集合成数据,我们获得了一个完全覆盖数据空间的多样化数据集。实证研究表明,我们的大量实验证明TreeSynth超越了人工设计的数据集和最先进的数据合成基线方法,在各种模型和任务中,数据多样性最大提升45.2%,下游任务性能最大提升17.6%。我们希望TreeSynth能提供一个可扩展的解决方案,无需人工干预即可从零合成多样化且全面的数据集。