We explore generating factual and accurate tables from the parametric knowledge of large language models (LLMs). While LLMs have demonstrated impressive capabilities in recreating knowledge bases and generating free-form text, we focus on generating structured tabular data, which is crucial in domains like finance and healthcare. We examine the table generation abilities of four state-of-the-art LLMs: GPT-3.5, GPT-4, Llama2-13B, and Llama2-70B, using three prompting methods for table generation: (a) full-table, (b) row-by-row; (c) cell-by-cell. For evaluation, we introduce a novel benchmark, WikiTabGen which contains 100 curated Wikipedia tables. Tables are further processed to ensure their factual correctness and manually annotated with short natural language descriptions. Our findings reveal that table generation remains a challenge, with GPT-4 reaching the highest accuracy at 19.6%. Our detailed analysis sheds light on how various table properties, such as size, table popularity, and numerical content, influence generation performance. This work highlights the unique challenges in LLM-based table generation and provides a solid evaluation framework for future research. Our code, prompts and data are all publicly available: https://github.com/analysis-bots/WikiTabGen
翻译:本研究探索利用大型语言模型(LLM)的参数化知识生成具备事实准确性表格的方法。尽管LLM在重建知识库和生成自由格式文本方面已展现出卓越能力,我们重点关注结构化表格数据的生成——这在金融和医疗等关键领域尤为重要。我们通过三种提示方法(a)整表生成;(b)逐行生成;(c)逐单元格生成,系统评估了四种前沿LLM(GPT-3.5、GPT-4、Llama2-13B和Llama2-70B)的表格生成能力。为建立评估基准,我们构建了包含100个精选维基百科表格的新型评测集WikiTabGen,所有表格均经过事实准确性校验并辅以人工标注的简短自然语言描述。实验结果表明表格生成仍具挑战性,其中GPT-4以19.6%的准确率表现最佳。我们通过细粒度分析揭示了表格规模、领域普及度及数值内容等属性对生成性能的影响机制。本工作不仅凸显了基于LLM的表格生成面临的特殊挑战,更为后续研究提供了坚实的评估框架。相关代码、提示模板及数据均已开源:https://github.com/analysis-bots/WikiTabGen