Semantic text similarity plays an important role in software engineering tasks in which engineers are requested to clarify the semantics of descriptive labels (e.g., business terms, table column names) that are often consists of too short or too generic words and appears in their IT systems. We formulate this type of problem as a task of matching descriptive labels to glossary descriptions. We then propose a framework to leverage an existing semantic text similarity measurement (STS) and augment it using semantic label enrichment and set-based collective contextualization where the former is a method to retrieve sentences relevant to a given label and the latter is a method to compute similarity between two contexts each of which is derived from a set of texts (e.g., column names in the same table). We performed an experiment on two datasets derived from publicly available data sources. The result indicated that the proposed methods helped the underlying STS correctly match more descriptive labels with the descriptions.
翻译:语义文本相似度在软件工程任务中扮演着重要角色,工程师需要澄清IT系统中常由过短或过于通用的词汇构成的描述性标签(如业务术语、表格列名)的语义。我们将此类问题形式化为描述性标签与术语表描述的匹配任务。继而提出一个框架,利用现有语义文本相似度测量方法(STS),并通过语义标签增强和基于集合的集体语境化两种技术对其进行增强:前者是检索与给定标签相关句子的方法,后者是计算源自文本集合(如同表格中的列名)的两个语境之间相似度的技术。我们在两个源自公开数据源的数据集上进行了实验。结果表明,所提出的方法有助于底层STS更准确地匹配更多描述性标签与描述。