Recent advances in large language models have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NameGuess, to expand column names (used in database schema) as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the complexities associated with polysemy and ambiguity in NameGuess, we enhance auto-regressive language models by conditioning on table content and column header names -- yielding a fine-tuned model (with 2.7B parameters) that matches human performance. Furthermore, we conduct a comprehensive analysis (on multiple LLMs) to validate the effectiveness of table content in NameGuess and identify promising future opportunities. Code has been made available at https://github.com/amazon-science/nameguess.
翻译:大语言模型的最新进展已革新包括数据库行业在内的众多领域。处理大规模表格数据时面临的常见挑战之一是普遍存在的缩写列名,这会负面影响数据搜索、访问和理解等多种任务的性能。为解决此问题,我们提出名为NameGuess的新任务,将数据库模式中的列名扩展建模为自然语言生成问题。我们采用新型数据构造方法创建了包含38.4万个缩写-扩展列对的数据集,以及包含来自真实世界表格的9200个示例的人工标注评估基准。针对NameGuess任务中多义性和歧义性的复杂性,我们通过以表格内容和列标题为条件改进自回归语言模型,最终获得与人类表现相当的微调模型(含27亿参数)。此外,我们进行了综合分析(基于多个大语言模型),验证表格内容在NameGuess中的有效性,并识别有前景的未来研究方向。代码已开源:https://github.com/amazon-science/nameguess。