Tabular foundation models for in-context prediction of molecular properties

Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100\% win rates on 30 MoleculeACE tasks, while compact RDKit2d and Mordred descriptors provide strong descriptor-based alternatives. Molecular representation emerges as a key determinant in TFM performance, with molecular foundation model embeddings and 2D descriptor sets both providing substantial gains over classic molecular fingerprints on many tasks. These results suggest that in-context learning with TFMs provides a highly accurate and cost-efficient alternative for property prediction in practical applications.

翻译：准确的分子性质预测是药物发现、催化和工艺设计的核心，然而实际应用常受限于小数据集。分子基础模型通过学习可迁移的分子表征提供了有前景的方向，但通常需要特定任务的微调、机器学习专业知识，且往往难以超越经典基线方法。表格化基础模型（TFMs）提出了根本不同的范式：它们通过上下文学习进行预测，无需特定任务训练即可推理。在此，我们评估TFMs在低至中等数据量场景下的表现，涵盖标准化制药基准和化学工程数据集。我们同时评估了冻结的分子基础模型表征以及经典描述符和指纹。在各项基准测试中，与微调相比，该方法在降低计算成本的同时展现出优异的预测性能，这些优势同样适用于实际工程数据场景。特别地，将TFMs与CheMeleon嵌入相结合，在30个MoleculeACE任务上可实现高达100%的胜率，而紧凑的RDKit2d和Mordred描述符则提供了强大的基于描述符的替代方案。分子表征成为决定TFM性能的关键因素：分子基础模型嵌入和二维描述符集在许多任务上均比经典分子指纹带来显著提升。这些结果表明，利用TFMs进行上下文学习为实际应用中的性质预测提供了高精度且经济高效的替代方案。