Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance

Large healthcare institutions typically operate multiple business intelligence (BI) teams segmented by domain, including clinical performance, fundraising, operations, and compliance. Due to HIPAA, FERPA, and IRB restrictions, these teams face challenges in sharing patient-level data needed for analytics. To mitigate this, A metric aggregation table is proposed, which is a precomputed, privacy-compliant summary. These abstractions enable decision-making without direct access to sensitive data. However, even aggregated metrics can inadvertently lead to privacy risks if constructed without rigorous safeguards. A modular AI framework is proposed that evaluates SQL-based metric definitions for potential overexposure using both semantic and syntactic features. Specifically, the system parses SQL queries into abstract syntax trees (ASTs), extracts sensitive patterns (e.g., fine-grained GROUP BY on ZIP code or gender), and encodes the logic using pretrained CodeBERT embeddings. These are fused with structural features and passed to an XGBoost classifier trained to assign risk scores. Queries that surpass the risk threshold (e.g., > 0.85) are flagged and returned with human-readable explanations. This enables proactive governance, preventing statistical disclosure before deployment. This implementation demonstrates strong potential for cross-departmental metric sharing in healthcare while maintaining compliance and auditability. The system also promotes role-based access control (RBAC), supports zero-trust data architectures, and aligns with national data modernization goals by ensuring that metric pipelines are explainable, privacy-preserving, and AI-auditable by design. Unlike prior works that rely on runtime data access to flag privacy violations, the proposed framework performs static, explainable detection at the query-level, enabling pre-execution protection and audit readiness

翻译：大型医疗机构通常运营着多个按领域划分的商业智能团队，涵盖临床绩效、资金筹集、运营管理和合规监管等。由于受到HIPAA、FERPA及IRB等法规限制，这些团队在共享分析所需的患者级数据时面临挑战。为缓解此问题，本文提出一种指标聚合表方案，即一种预先计算且符合隐私规范的汇总数据表。此类抽象化数据使得决策制定无需直接访问敏感原始数据。然而，即使是指标聚合数据，若构建时缺乏严格防护措施，仍可能无意中引发隐私风险。本文提出一种模块化人工智能框架，通过语义与句法双重特征评估基于SQL的指标定义可能导致的过度暴露风险。具体而言，该系统将SQL查询解析为抽象语法树，提取敏感模式（例如基于邮政编码或性别的细粒度GROUP BY操作），并利用预训练的CodeBERT嵌入对查询逻辑进行编码表征。这些语义特征与结构特征融合后，输入经过训练的XGBoost分类器以分配风险评分。超过风险阈值（如>0.85）的查询将被标记，并返回可读性强的解释说明。该方法实现了主动治理机制，能在部署前预防统计性信息披露。实践表明，该系统在保障合规性与可审计性的前提下，对促进医疗机构跨部门指标共享具有显著潜力。该系统同时支持基于角色的访问控制，适配零信任数据架构，并通过确保指标流水线具备可解释性、隐私保护性和人工智能可审计性，与国家数据现代化目标相契合。与以往依赖运行时数据访问来标记隐私违规的研究不同，本框架在查询层级执行静态可解释的检测，实现了执行前防护与审计就绪能力。