We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.
翻译:我们研究面向自回归机器学习任务的数据蒸馏方法,此类任务的输入与输出具有严格的从左到右因果结构。具体而言,我们提出Farzi方法,它将事件序列数据集压缩为少量合成序列——Farzi数据——这些序列经过优化,能够保持(甚至提升)模型性能,使其不逊于在完整数据集上训练的结果。Farzi的核心是通过以下方式实现内存高效的数据蒸馏:(i)利用海森-向量积推导Adam优化器的高效反向模式微分;(ii)将高维离散事件空间分解为潜空间,该分解被证明能促进隐式正则化。实验表明,在序列推荐与语言建模任务中,当使用仅占原始数据集0.1%规模的Farzi数据训练当前最先进模型时,可达到完整数据下游性能的98%-120%。值得注意的是,用显著更少的数据训练出更优模型的能力,为未来大型自回归模型的设计提供了新思路,并开辟了进一步扩展模型规模与数据量的新机遇。