In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic unified understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training, which employs joint contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the pre-trained embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios where available data is limited.
翻译:论文摘要:在符号化数学方程对于复杂自然现象建模不可或缺的时代,科学探索往往涉及收集观测数据并将其转化为数学表达式。近年来,深度学习已成为从数据中提取知识的有力工具。然而,现有模型通常专门处理数值域或符号域,且常以监督学习方式针对特定任务进行训练。这种范式忽略了从符号方程及其数值表示之间任务无关的统一理解中可能获得的显著收益。为弥合这一差距,我们提出SNIP(符号-数值集成预训练),该方法在符号域与数值域之间采用联合对比学习,增强了预训练嵌入中二者的互相似性。通过潜在空间分析,我们发现SNIP提供了跨域表示见解,表明符号监督能增强数值数据的嵌入表示,反之亦然。我们在包括符号到数值的数学属性预测以及数值到符号的方程发现(即符号回归)等多样化任务上评估了SNIP。结果表明,SNIP能有效迁移至多种任务,持续超越全监督基线方法,并在小样本学习等数据有限的场景中与成熟的专用方法展现出强劲竞争力。