AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation - 专知论文

会员服务 ·

0

无监督 · Performer · MoDELS · 数据增强 · Extensibility ·

2023 年 3 月 7 日

AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation

翻译：AugTriever：通过可扩展数据增强实现的无监督密集检索

Rui Meng,Ye Liu,Semih Yavuz,Divyansh Agarwal,Lifu Tu,Ning Yu,Jianguo Zhang,Meghana Bhat,Yingbo Zhou

Dense retrievers have made significant strides in text retrieval and open-domain question answering, even though most achievements were made possible only with large amounts of human supervision. In this work, we aim to develop unsupervised methods by proposing two methods that create pseudo query-document pairs and train dense retrieval models in an annotation-free and scalable manner: query extraction and transferred query generation. The former method produces pseudo queries by selecting salient spans from the original document. The latter utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that models trained with the proposed augmentation methods can perform comparably well (or better) to multiple strong baselines. Combining those strategies leads to further improvements, achieving the state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.

翻译：密集检索器在文本检索和开放域问答方面取得了显著进展，尽管这些成就大多依赖于大量人工监督。本文旨在开发无监督方法，提出两种创建伪查询-文档对的技术，以无标注且可扩展的方式训练密集检索模型：查询提取和查询迁移生成。前者通过从原始文档中选择显著性片段来生成伪查询，后者则利用为其他自然语言处理任务（如文本摘要）训练的生成模型来产生伪查询。大量实验表明，采用所提出的增强方法训练的模型可达到与多个强基线方法相当（或更优）的性能。结合这些策略可进一步改进，在无监督密集检索的BEIR和ODQA数据集上均取得最先进水平的性能。

0

相关内容

无监督

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

基于自噬系统mTOR信号通路探讨扶正祛邪中药小复方干预阿尔茨海默病模型的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

CIP2A对蛋白磷酸酯酶2A的调节及其在阿尔茨海默病发病中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

Runx3基因DNA甲基化介导BPD肺上皮细胞转分化的作用及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

丹参经表观遗传调控Nrf2/ARE通路及降低核苷酸类似物肾毒性的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于EEG和fNIRS的多模态脑机接口运动想象参数研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于CYP450酶表达调控及代谢组学的五味子醋制保肝作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

P53蛋白调节mTOR信号通路诱导胰腺癌吉西他滨耐药的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

炎症细胞因子DNA甲基化影响炎性衰老的机制

国家自然科学基金

0+阅读 · 2011年12月31日

DNA甲基化介导的CLDN6表达沉默机制及其对人乳腺癌细胞转移表型的影响

国家自然科学基金

0+阅读 · 2011年12月31日

TGF-βsmads信号通路对失神经骨骼肌纤维化调控机制的实验研究

国家自然科学基金

0+阅读 · 2008年12月31日

A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning

Arxiv

0+阅读 · 2023年4月28日

Multivariate Representation Learning for Information Retrieval

Arxiv

0+阅读 · 2023年4月27日

Person Re-ID through Unsupervised Hypergraph Rank Selection and Fusion

Arxiv

0+阅读 · 2023年4月27日

Large Language Models are Strong Zero-Shot Retriever

Arxiv

0+阅读 · 2023年4月27日

Retrieval-based Knowledge Augmented Vision Language Pre-training

Arxiv

0+阅读 · 2023年4月27日

A Personalized Dense Retrieval Framework for Unified Information Access

A Personalized Dense Retrieval Framework for Unified Information Access

Arxiv

0+阅读 · 2023年4月26日

ContrastMask: Contrastive Learning to Segment Every Thing

Arxiv

15+阅读 · 2022年3月18日

MetAug: Contrastive Learning via Meta Feature Augmentation

Arxiv

10+阅读 · 2022年3月10日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

VIP会员

文章信息

相关主题

最新内容

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

3+阅读 · 今天14:49

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

3+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

5+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

6+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

7+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

11+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

10+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

7+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

11+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

7+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

15+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

8+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

6+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

8+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

8+阅读 · 6月17日

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

相关论文

A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning

Arxiv

0+阅读 · 2023年4月28日

Multivariate Representation Learning for Information Retrieval

Arxiv

0+阅读 · 2023年4月27日

Person Re-ID through Unsupervised Hypergraph Rank Selection and Fusion

Arxiv

0+阅读 · 2023年4月27日

Large Language Models are Strong Zero-Shot Retriever

Arxiv

0+阅读 · 2023年4月27日

Retrieval-based Knowledge Augmented Vision Language Pre-training

Arxiv

0+阅读 · 2023年4月27日

A Personalized Dense Retrieval Framework for Unified Information Access

A Personalized Dense Retrieval Framework for Unified Information Access

Arxiv

0+阅读 · 2023年4月26日

ContrastMask: Contrastive Learning to Segment Every Thing

Arxiv

15+阅读 · 2022年3月18日

MetAug: Contrastive Learning via Meta Feature Augmentation

Arxiv

10+阅读 · 2022年3月10日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

相关基金

基于自噬系统mTOR信号通路探讨扶正祛邪中药小复方干预阿尔茨海默病模型的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

CIP2A对蛋白磷酸酯酶2A的调节及其在阿尔茨海默病发病中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

Runx3基因DNA甲基化介导BPD肺上皮细胞转分化的作用及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

丹参经表观遗传调控Nrf2/ARE通路及降低核苷酸类似物肾毒性的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于EEG和fNIRS的多模态脑机接口运动想象参数研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于CYP450酶表达调控及代谢组学的五味子醋制保肝作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

P53蛋白调节mTOR信号通路诱导胰腺癌吉西他滨耐药的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

炎症细胞因子DNA甲基化影响炎性衰老的机制

国家自然科学基金

0+阅读 · 2011年12月31日

DNA甲基化介导的CLDN6表达沉默机制及其对人乳腺癌细胞转移表型的影响

国家自然科学基金

0+阅读 · 2011年12月31日

TGF-βsmads信号通路对失神经骨骼肌纤维化调控机制的实验研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员