Developers often have questions about semantic aspects of code they are working on, e.g., "Is there a class whose parent classes declare a conflicting attribute?". Answering them requires understanding code semantics such as attributes and inheritance relation of classes. An answer to such a question should identify code spans constituting the answer (e.g., the declaration of the subclass) as well as supporting facts (e.g., the definitions of the conflicting attributes). The existing work on question-answering over code has considered yes/no questions or method-level context. We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code. Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans. We curate the dataset based on queries supported by a widely-used static analysis tool, CodeQL, and include both positive and negative examples, and queries requiring single-hop and multi-hop reasoning. To assess the value of our dataset, we evaluate baseline neural approaches. We study a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries. We also evaluate a BERT style model (CuBERT) with fine-tuning. We find that these models achieve limited success on CodeQueries. CodeQueries is thus a challenging dataset to test the ability of neural models, to understand code semantics, in the extractive question-answering setting.
翻译:开发者在工作中常会遇到关于代码语义方面的问题,例如:"是否存在某个类,其父类声明了冲突属性?"回答此类问题需要理解代码语义,如类的属性及继承关系。问题的答案不仅应标识构成答案的代码片段(例如子类的声明),还需提供支持性事实(例如冲突属性的定义)。现有关于代码问答的研究多集中于是非题或方法级上下文。我们贡献了一个名为CodeQueries的标注数据集,专门针对Python代码的语义查询。与现有数据集相比,CodeQueries中的查询聚焦于代码语义,上下文为文件级别,答案对应代码片段。我们基于广泛使用的静态分析工具CodeQL支持的查询构建了该数据集,包含正例与反例,以及需要单跳与多跳推理的查询。为评估数据集价值,我们测试了基线神经方法:在CodeQueries子集上,对大型语言模型(GPT3.5-Turbo)进行零样本与少样本评估;同时微调了BERT风格模型(CuBERT)。结果表明,这些模型在CodeQueries上表现有限。因此,CodeQueries作为抽取式问答场景中测试神经模型理解代码语义能力的挑战性数据集具有显著价值。