The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.
翻译:递归训练合成数据的普及虽能缓解数据稀缺,却存在模型崩溃风险——重复训练会侵蚀分布尾部并导致输出同质化。数据选择被广泛视为补救措施,但其可靠性关键取决于验证者使用的参考分布。我们证明,在低资源验证场景下,每个验证者仅能观察到目标流形的零散偏倚片段,选择本身就会产生偏差。这种情形自然出现在如医疗联盟或专有金融机构等低资源数据孤岛中:原始数据无法合并,局部参考必然不完备。因此,选择机制会优先保留与局部流形对齐的样本,同时裁剪全局相关的尾部模式,使其从防止崩溃的保护机制转变为加速崩溃的催化剂。我们从理论上证明,此类孤岛式选择会加速模型崩溃并导致多样性呈幂律衰减。作为初步缓解方案,我们构建无需共享原始数据的多孤岛Wasserstein代理参考。实验结果表明,基于局部参考的选择会在偏斜分布上失效,而协作式代理参考能缓解多样性退化,这提示在真实数据覆盖零散或稀缺时,递归合成数据管道需格外谨慎。