Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($\Sigma$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $\Sigma$AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that $\Sigma$AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.
翻译:传统语言模型擅长文本序列中的词元预测任务,但在处理不同符号系统间的转换任务时往往表现欠佳,尤其是在并行数据稀缺的情况下。针对这一问题,我们提出符号自编码($\Sigma$AE)框架,这是一种自监督学习方法,能够充分利用大量非并行数据与少量并行数据的协同作用。$\Sigma$AE通过离散瓶颈层连接两个生成模型,并通过最小化重构损失(同时结合并行数据的监督损失)进行端到端优化,使得离散瓶颈层生成的序列能够被直接解码为待转换的输入序列。我们同时开发了基于梯度的方法,在瓶颈层离散性的约束下实现高效的自监督序列学习。实验结果表明,即使仅有极少量并行数据,$\Sigma$AE仍能显著提升转换任务性能,为弱监督学习场景提供了有前景的解决方案。