Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods.
翻译:自动查询改写是一种广泛用于丰富用户需求并提升代码搜索效果的技术。该技术可被建模为机器翻译任务,其目标是将给定查询重写为更全面的替代表述。尽管已有方法展现出良好效果,但训练此类模型通常需要大规模并行查询语料库(即原始查询与改写查询的配对数据),而这些数据往往因代码检索引擎的保密性而无法公开获取,从而限制了其在软件开发过程中的实用性。本文提出SSQR——一种无需依赖任何并行查询语料库的自监督查询改写方法。受预训练模型启发,SSQR将查询改写视为对大规模未标注查询语料库执行的掩码语言建模任务。该方法通过引入名为受损查询补全(CQC)的新预训练目标扩展了T5(基于Transformer的序列到序列模型):CQC随机掩码完整查询中的词语,并训练T5预测被掩码内容。随后,对于待改写的查询,SSQR识别可能的扩展位置,并利用预训练T5模型生成合适内容填补这些空缺,最终基于各候选扩展的信息增益进行选择。评估结果表明,SSQR显著优于无监督基线方法,且性能可与有监督方法相媲美。