Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into machine-readable formats is challenging due to the complexities of natural language and the scarcity of resources for advanced Machine Learning (ML). Addressing these challenges, we introduce CODE-ACCORD, a dataset of 862 sentences from the building regulations of England and Finland. Only the self-contained sentences, which express complete rules without needing additional context, were considered as they are essential for ACC. Each sentence was manually annotated with entities and relations by a team of 12 annotators to facilitate machine-readable rule generation, followed by careful curation to ensure accuracy. The final dataset comprises 4,297 entities and 4,329 relations across various categories, serving as a robust ground truth. CODE-ACCORD supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction. It enables applying recent trends, such as deep neural networks and large language models, to ACC.
翻译:在建筑、工程与施工(AEC)领域,要实现自动合规审查(ACC)的全部潜力,必须对建筑规范进行自动化解读。由于自然语言的复杂性以及用于高级机器学习(ML)的资源稀缺,将文本规则转换为机器可读格式具有挑战性。为应对这些挑战,我们推出了CODE-ACCORD数据集,该数据集包含来自英格兰和芬兰建筑规范的862个句子。仅考虑了那些自包含的句子,即无需额外上下文即可表达完整规则的句子,因为它们对ACC至关重要。每个句子均由一个由12名标注员组成的团队手动标注了实体和关系,以促进机器可读规则的生成,随后经过仔细整理以确保准确性。最终数据集包含各类别的4,297个实体和4,329个关系,构成了一个可靠的基准真值。CODE-ACCORD支持一系列ML和自然语言处理(NLP)任务,包括文本分类、实体识别和关系抽取。它使得将深度神经网络和大型语言模型等最新趋势应用于ACC成为可能。