This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the zero-shot setting. SANTA learns tailored representations for multi-modal text data by aligning structured and unstructured data pairs and capturing structural semantics by masking and predicting entities in the structured data. All codes are available at https://github.com/OpenMatch/OpenMatch.
翻译:本文提出结构感知稠密检索(SANTA)模型,该模型将用户查询与结构化数据编码至统一嵌入空间,以实现结构化数据的检索。SANTA提出了两种预训练方法,使语言模型具备结构感知能力,并为结构化数据学习有效表示:1) 结构化数据对齐:利用结构化数据与非结构化数据之间的自然对齐关系进行结构感知预训练。该方法通过对比学习训练语言模型表示多模态文本数据,并教导模型区分与非结构化文本匹配的结构化数据。2) 掩码实体预测:设计面向实体的掩码策略,要求语言模型填充被掩码的实体。实验表明,SANTA在代码搜索和产品搜索任务上达到了最先进水平,并在零样本设置下取得令人信服的结果。SANTA通过对齐结构化与非结构化数据对来学习多模态文本数据的定制化表示,并通过掩码和预测结构化数据中的实体捕捉结构语义。所有代码已开源至 https://github.com/OpenMatch/OpenMatch。