Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.
翻译:由于大型语言模型(LLMs)对数字的默认编码方式具有不连续性和离散性,目前尚未被广泛应用于处理数值密集型的科学数据集。然而,将数据集以文本形式呈现,有助于将多样化的多模态科学数据整合到统一的训练语料库中,从而可能推动科学基础模型的发展。本研究提出xVal,一种在语言模型内部对数字进行连续分词的策略,该策略能为科学应用提供更合适的归纳偏置。通过在多种以文本格式呈现的科学数据集上从头训练经过特殊修改的语言模型,我们发现xVal在分布外泛化能力和计算效率等指标上普遍优于其他常见的数值分词策略。