We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.
翻译:我们提出QueryNER,一个手工标注的数据集及配套模型,用于电商查询分割。先前电商领域的序列标注工作主要聚焦于方面-值提取,其目标是提取产品标题或查询中针对狭窄定义方面的片段。而本研究则致力于将查询划分为具有广泛适用性类型的有意义片段。我们报告了基线标注结果,并进行了对比实验,研究了在零召回和低召回查询恢复场景下丢弃令牌与实体丢弃的效果。通过自动转换方法构建了具有挑战性的测试集,并展示了简单的数据增强技术如何使模型对噪声更具鲁棒性。我们将QueryNER数据集公开发布。