Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.
翻译:如今,监管合规已成为公司治理的基石,确保企业遵循系统化的法律框架。金融法规的核心通常包含高度复杂的条款、分层的逻辑结构以及大量例外情况,这不可避免地导致人工密集型工作或理解困难。为缓解这一问题,近期的监管科技(RegTech)与大型语言模型(LLMs)在将监管文本自动转换为可执行合规逻辑方面获得了显著关注。然而,其性能尤其在应用于中文金融法规时仍不理想,这主要源于三个关键局限:(1)领域特定知识表示不完整,(2)分层推理能力不足,以及(3)未能保持时间与逻辑一致性。一种有前景的解决方案是为模型训练开发领域特定且面向代码的数据集。现有数据集如LexGLUE、LegalBench和CODE-ACCORD通常以英语为中心、领域不匹配或缺乏用于合规代码生成的细粒度标注。为填补这些空白,我们提出了Compliance-to-Code,首个专用于金融监管合规的大规模中文数据集。该数据集涵盖十大类别、361项法规中的1,159条标注条款,每条条款均采用模块化结构,包含主体、条件、约束和上下文信息四个逻辑要素以及法规关系。我们提供了确定性的Python代码映射、详细的代码推理和代码解释,以促进自动化审计。为展示其实用性,我们提出了FinCheck:一个集法规结构化、代码生成和报告生成于一体的流程。