Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -- (1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.
翻译:创建有标签的自然语言培训数据费用昂贵,需要大量人力投入。我们利用仅用100个实例组成的小种子组培训的受监督采矿功能,开采大型公司的投入产出实例。采矿包括两个阶段:(1) 以双coder为基础的重整型密集搜索,将投入与潜在产出对齐,(2) 以交叉coder为基础的过滤器,将双coder阶段的输出重新排序,以便更加精确。与模型产生的数据增强不同,我们的方法地雷自然产生高质量的输入输出配对,以模仿种子的风格,用于多项任务。在SQuAD型阅读理解上,用埋藏数据来增加种子组,使13F1比只对种子组进行BART大型基线微调的改进。同样,我们看到了1.46 ROUGE-L的改进,用于Xsumptimive合成组合。