Generating Data for Symbolic Language with Large Language Models - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · HTTPS · 推断 · Extensibility ·

2023 年 5 月 23 日

Generating Data for Symbolic Language with Large Language Models

翻译：使用大语言模型生成符号语言数据

Jiacheng Ye,Chengzu Li,Lingpeng Kong,Tao Yu

While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.

翻译：尽管大语言模型（LLMs）在带来性能提升的同时也引入了复杂性，近期研究开始将LLM用作数据生成器而非任务推理器，通过训练另一个成本较低的任务模型来实现高效的部署与推理。然而，此类方法目前主要应用于自然语言任务，尚未在具有复杂结构化输出的符号语言任务（如语义解析、代码生成）中展开探索。本文提出SymGen方法，利用LLM生成各类标注成本高昂的符号语言数据。SymGen包含引导生成的信息性提示和基于验证机制的共识校验器以提高数据正确性。我们在六种不同场景下的符号语言任务上开展了广泛实验。结果表明，与LLM相比，仅为其1%量级的任务模型即可达到相近或更优性能，大幅降低了推理与部署成本。同时我们发现，在训练任务模型时，仅使用少量人工演示数据生成的样本，其效果可与超过十倍规模的人工标注数据相媲美，显著节约了人工标注成本。SymGen为复杂任务的数据生成开辟了新路径，相关代码已开源至\href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}。

0

相关内容

语言模型化

语言模型化

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

160+阅读 · 2020年8月7日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

拟人化hERG基因突变及敲除的糖尿病动物模型的建立

国家自然科学基金

0+阅读 · 2014年12月31日

多元多项式环的Hermite性质与多项式矩阵的分解

国家自然科学基金

0+阅读 · 2014年12月31日

刚性微分方程高阶隐式离散解的快速迭代算法

国家自然科学基金

0+阅读 · 2013年12月31日

磁化氢气脉冲放电的PIC/MC/DSMC模拟研究

国家自然科学基金

0+阅读 · 2012年12月31日

非线性椭圆型偏微分方程的边界正则性

国家自然科学基金

0+阅读 · 2012年12月31日

BdDUOX和BdRelish在橘小实蝇肠道微生物群落稳态维持中的作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

家蚕贮藏蛋白Arylphorin结构解析及分子降解机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

自然环境与目标电磁散射和极化SAR遥感信息理论

国家自然科学基金

0+阅读 · 2009年12月31日

Blazars和微类星体的多波段观测与研究

国家自然科学基金

0+阅读 · 2008年12月31日

TIM: Teaching Large Language Models to Translate with Comparison

Arxiv

0+阅读 · 2023年7月10日

Large Language Models for Supply Chain Optimization

Arxiv

0+阅读 · 2023年7月8日

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Multimodal Large Language Models

Arxiv

25+阅读 · 2023年6月23日

A Survey on Large Language Models for Recommendation

Arxiv

12+阅读 · 2023年5月31日

Augmented Large Language Models with Parametric Knowledge Guiding

Arxiv

20+阅读 · 2023年5月8日

Conditional Prompt Learning for Vision-Language Models

Conditional Prompt Learning for Vision-Language Models

Arxiv

13+阅读 · 2022年3月10日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

VIP会员

文章信息

相关主题

语言模型化

最新内容

“天降毒雾”：无人机如何使化学战重返乌克兰战场

“天降毒雾”：无人机如何使化学战重返乌克兰战场

专知会员服务

0+阅读 · 今天11:28

伊朗不对称防空战略的演进

伊朗不对称防空战略的演进

专知会员服务

1+阅读 · 今天11:10

对抗环境下超视距目标打击的情报支援

对抗环境下超视距目标打击的情报支援

专知会员服务

10+阅读 · 7月22日

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

专知会员服务

4+阅读 · 7月22日

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

专知会员服务

8+阅读 · 7月22日

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

专知会员服务

8+阅读 · 7月22日

《无人机对海面作战影响评估》

《无人机对海面作战影响评估》

专知会员服务

15+阅读 · 7月21日

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

专知会员服务

13+阅读 · 7月21日

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

专知会员服务

4+阅读 · 7月21日

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

专知会员服务

6+阅读 · 7月21日

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

专知会员服务

9+阅读 · 7月21日

印度精确打击与指挥架构的断层

印度精确打击与指挥架构的断层

专知会员服务

7+阅读 · 7月20日

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

专知会员服务

9+阅读 · 7月20日

美空军AI完成F-16战斗机自主空战历史性试飞

美空军AI完成F-16战斗机自主空战历史性试飞

专知会员服务

8+阅读 · 7月20日

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

专知会员服务

10+阅读 · 7月20日

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

160+阅读 · 2020年8月7日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

伊朗不对称防空战略的演进

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

“天降毒雾”：无人机如何使化学战重返乌克兰战场

对抗环境下超视距目标打击的情报支援

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

TIM: Teaching Large Language Models to Translate with Comparison

Arxiv

0+阅读 · 2023年7月10日

Large Language Models for Supply Chain Optimization

Arxiv

0+阅读 · 2023年7月8日

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Multimodal Large Language Models

Arxiv

25+阅读 · 2023年6月23日

A Survey on Large Language Models for Recommendation

Arxiv

12+阅读 · 2023年5月31日

Augmented Large Language Models with Parametric Knowledge Guiding

Arxiv

20+阅读 · 2023年5月8日

Conditional Prompt Learning for Vision-Language Models

Conditional Prompt Learning for Vision-Language Models

Arxiv

13+阅读 · 2022年3月10日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

相关基金

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

拟人化hERG基因突变及敲除的糖尿病动物模型的建立

国家自然科学基金

0+阅读 · 2014年12月31日

多元多项式环的Hermite性质与多项式矩阵的分解

国家自然科学基金

0+阅读 · 2014年12月31日

刚性微分方程高阶隐式离散解的快速迭代算法

国家自然科学基金

0+阅读 · 2013年12月31日

磁化氢气脉冲放电的PIC/MC/DSMC模拟研究

国家自然科学基金

0+阅读 · 2012年12月31日

非线性椭圆型偏微分方程的边界正则性

国家自然科学基金

0+阅读 · 2012年12月31日

BdDUOX和BdRelish在橘小实蝇肠道微生物群落稳态维持中的作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

家蚕贮藏蛋白Arylphorin结构解析及分子降解机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

自然环境与目标电磁散射和极化SAR遥感信息理论

国家自然科学基金

0+阅读 · 2009年12月31日

Blazars和微类星体的多波段观测与研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员