Valid Inference with Synthetic Data via Task Exchangeability - 专知论文

会员服务 ·

0

推断 · 可交换的 · AI · 样例 · MoDELS ·

Valid Inference with Synthetic Data via Task Exchangeability

翻译：暂无翻译

Lezhi Tan,Tijana Zrnic

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

翻译：暂无翻译

0

相关内容

ICLR 2026 | DataMind: 构建通用数据分析智能体

ICLR 2026 | DataMind: 构建通用数据分析智能体

专知会员服务

15+阅读 · 3月29日

【牛津大学博士论文】关系数据的学习和推理，243页pdf

【牛津大学博士论文】关系数据的学习和推理，243页pdf

专知会员服务

54+阅读 · 2022年11月16日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【伯克利 | 情感计算】大规模异构多媒体数据的情感计算:综述论文，37页pdf，171篇参考文献，Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

【伯克利 | 情感计算】大规模异构多媒体数据的情感计算:综述论文，37页pdf，171篇参考文献，Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

专知会员服务

32+阅读 · 2019年11月15日

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

专知会员服务

10+阅读 · 2019年11月3日

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

专知会员服务

14+阅读 · 2019年10月25日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知

10+阅读 · 2022年2月10日

基于深度元学习的因果推断新方法

基于深度元学习的因果推断新方法

图与推荐

12+阅读 · 2020年7月21日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

AINLP

15+阅读 · 2019年8月26日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

为新研究准备好一块用武之地：最全任务型对话数据调研

为新研究准备好一块用武之地：最全任务型对话数据调研

PaperWeekly

12+阅读 · 2019年2月11日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

香港中大-商汤科技联合实验室AAAI录用论文详解：ST-GCN时空图卷积网络模型

香港中大-商汤科技联合实验室AAAI录用论文详解：ST-GCN时空图卷积网络模型

商汤科技

12+阅读 · 2018年2月11日

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

极市平台

16+阅读 · 2018年1月20日

超大规模约束优化问题算法及其应用天元数学交流项目

国家自然科学基金

2+阅读 · 2017年12月31日

天元数学交流项目图像处理中的数学理论及方法研讨会

国家自然科学基金

9+阅读 · 2017年12月31日

面向海量高维数据的可深度结合的贝叶斯网学习与推理新方法研究

国家自然科学基金

6+阅读 · 2015年12月31日

函数数据变换模型及降维方法的研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

数据分析算法的融合与人才培养

国家自然科学基金

7+阅读 · 2015年12月31日

超高维数据中若干检验问题的研究

国家自然科学基金

0+阅读 · 2015年12月31日

大规模轨迹数据的地理空间关联解译及分析挖掘研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于复杂网络的商务大数据聚类与关联应用研究

国家自然科学基金

2+阅读 · 2014年12月31日

行为轨迹数据高性能时空聚类及社会分析

国家自然科学基金

2+阅读 · 2014年12月31日

Critical Percolation as a Synthetic Data Model for Interpretability

Arxiv

0+阅读 · 6月18日

Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

Arxiv

0+阅读 · 6月18日

Geometry-Aware Dataset Condensation for Diffusion Model Training

Arxiv

0+阅读 · 6月17日

Towards Anomaly Detection on Relational Data

Arxiv

0+阅读 · 6月17日

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Arxiv

0+阅读 · 6月16日

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Arxiv

0+阅读 · 6月9日

Synthetic Data Generation With Incomplete Survey Data Under Informative Sampling

Arxiv

0+阅读 · 5月29日

A Generative Adversarial Graph Neural Network for Synthetic Time Series Data

Arxiv

0+阅读 · 5月21日

Data Fusion: Theory, Methods, and Applications

Arxiv

95+阅读 · 2022年8月2日

A Survey of Learning Causality with Data: Problems and Methods

A Survey of Learning Causality with Data: Problems and Methods

Arxiv

19+阅读 · 2018年9月25日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

3+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

5+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

6+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

7+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

11+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

10+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

7+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

11+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

7+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

15+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

8+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

6+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

8+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

8+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

10+阅读 · 6月17日

相关VIP内容

ICLR 2026 | DataMind: 构建通用数据分析智能体

ICLR 2026 | DataMind: 构建通用数据分析智能体

专知会员服务

15+阅读 · 3月29日

【牛津大学博士论文】关系数据的学习和推理，243页pdf

【牛津大学博士论文】关系数据的学习和推理，243页pdf

专知会员服务

54+阅读 · 2022年11月16日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【伯克利 | 情感计算】大规模异构多媒体数据的情感计算:综述论文，37页pdf，171篇参考文献，Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

【伯克利 | 情感计算】大规模异构多媒体数据的情感计算:综述论文，37页pdf，171篇参考文献，Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

专知会员服务

32+阅读 · 2019年11月15日

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

专知会员服务

10+阅读 · 2019年11月3日

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

From Data to Model Programming: Injecting Structured Priors for Knowledge Extraction，南加州大学计算机科学系任翔助理教授，CIPS ATT 16（2019）

专知会员服务

14+阅读 · 2019年10月25日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

相关资讯

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知

10+阅读 · 2022年2月10日

基于深度元学习的因果推断新方法

基于深度元学习的因果推断新方法

图与推荐

12+阅读 · 2020年7月21日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

AINLP

15+阅读 · 2019年8月26日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

为新研究准备好一块用武之地：最全任务型对话数据调研

为新研究准备好一块用武之地：最全任务型对话数据调研

PaperWeekly

12+阅读 · 2019年2月11日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

香港中大-商汤科技联合实验室AAAI录用论文详解：ST-GCN时空图卷积网络模型

香港中大-商汤科技联合实验室AAAI录用论文详解：ST-GCN时空图卷积网络模型

商汤科技

12+阅读 · 2018年2月11日

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

IBM新论文|SamplePairing：针对图像处理领域的高效数据增强方式

极市平台

16+阅读 · 2018年1月20日

相关论文

Critical Percolation as a Synthetic Data Model for Interpretability

Arxiv

0+阅读 · 6月18日

Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

Arxiv

0+阅读 · 6月18日

Geometry-Aware Dataset Condensation for Diffusion Model Training

Arxiv

0+阅读 · 6月17日

Towards Anomaly Detection on Relational Data

Arxiv

0+阅读 · 6月17日

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Arxiv

0+阅读 · 6月16日

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Arxiv

0+阅读 · 6月9日

Synthetic Data Generation With Incomplete Survey Data Under Informative Sampling

Arxiv

0+阅读 · 5月29日

A Generative Adversarial Graph Neural Network for Synthetic Time Series Data

Arxiv

0+阅读 · 5月21日

Data Fusion: Theory, Methods, and Applications

Arxiv

95+阅读 · 2022年8月2日

A Survey of Learning Causality with Data: Problems and Methods

A Survey of Learning Causality with Data: Problems and Methods

Arxiv

19+阅读 · 2018年9月25日

相关基金

超大规模约束优化问题算法及其应用天元数学交流项目

国家自然科学基金

2+阅读 · 2017年12月31日

天元数学交流项目图像处理中的数学理论及方法研讨会

国家自然科学基金

9+阅读 · 2017年12月31日

面向海量高维数据的可深度结合的贝叶斯网学习与推理新方法研究

国家自然科学基金

6+阅读 · 2015年12月31日

函数数据变换模型及降维方法的研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于复杂数据的回归模型统计推断及其应用

国家自然科学基金

3+阅读 · 2015年12月31日

数据分析算法的融合与人才培养

国家自然科学基金

7+阅读 · 2015年12月31日

超高维数据中若干检验问题的研究

国家自然科学基金

0+阅读 · 2015年12月31日

大规模轨迹数据的地理空间关联解译及分析挖掘研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于复杂网络的商务大数据聚类与关联应用研究

国家自然科学基金

2+阅读 · 2014年12月31日

行为轨迹数据高性能时空聚类及社会分析

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员