Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.
翻译:先前关于隐私政策的问答研究将任务定义为:根据用户查询,从政策文档中识别最相关的文本段落或句子列表。现有标注数据集存在严重类别不平衡问题(仅少量相关段落),限制了该领域的问答性能。本文提出了一种基于检索模型集成的数据扩充框架,该框架能从无标注的政策文档中捕获相关文本段落,并扩展训练集中的正样本。此外,为提升扩充数据的多样性与质量,我们利用多个预训练语言模型,并级联降噪过滤模型。通过在PrivacyQA基准测试中使用扩充数据,我们大幅提升了现有基线性能(F1值提升10%),并以50%的F1分数创下新的最优结果。消融实验进一步揭示了本方法的有效性。