Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization.
翻译:大语言模型尚未广泛适用于科学数据集分析,部分原因在于数字标记化存在的独特困难。本文提出xVal数值编码方案,该方案仅需单个标记即可表示任意实数。xVal通过将专用嵌入向量按数值大小进行缩放来表征给定实数。结合改进的数字推理方法,该策略可使模型在输入字符串数字到输出字符串数字的映射中实现端到端连续性,由此形成更适用于科学领域应用的归纳偏置。我们在多个合成数据集和真实数据集上进行了实证评估,与现有数字编码方案相比,xVal展现出更高的标记效率与更优的泛化能力。