Large language models (LLM) such as OpenAI's ChatGPT and GPT-3 offer unique testbeds for exploring the translation challenges of turning literacy into numeracy. Previous publicly-available transformer models from eighteen months prior and 1000 times smaller failed to provide basic arithmetic. The statistical analysis of four complex datasets described here combines arithmetic manipulations that cannot be memorized or encoded by simple rules. The work examines whether next-token prediction succeeds from sentence completion into the realm of actual numerical understanding. For example, the work highlights cases for descriptive statistics on in-memory datasets that the LLM initially loads from memory or generates randomly using python libraries. The resulting exploratory data analysis showcases the model's capabilities to group by or pivot categorical sums, infer feature importance, derive correlations, and predict unseen test cases using linear regression. To extend the model's testable range, the research deletes and appends random rows such that recall alone cannot explain emergent numeracy.
翻译:摘要:大型语言模型(如OpenAI的ChatGPT和GPT-3)为探索将读写能力转化为计算能力的翻译挑战提供了独特试验平台。十八个月前公开发布、规模小千倍的前代Transformer模型无法完成基础算术运算。本文对四个复杂数据集的统计分析结合了无法通过简单规则记忆或编码的算术操作。研究考察了下一个词预测能否从句子补全成功延伸至实际数字理解领域。例如,研究重点分析了以下案例:LLM从内存初始加载或通过Python库随机生成的数据集上执行描述性统计。由此产生的探索性数据分析展示了模型按类别分组或透视分类求和、推断特征重要性、推导相关性以及使用线性回归预测未见测试案例的能力。为扩展模型的可测试范围,研究人员通过删除和追加随机行来排除单纯记忆对涌现计算能力的解释。