Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute

Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.

翻译：兽医电子健康记录包含限制二次使用的隐私敏感标识符。尽管PetEVAL为兽医去标识化提供了基准，该领域仍属低资源范畴。本研究评估大型语言模型生成的合成叙述在不同训练机制下是否能改善去标识化安全性，重点关注（i）合成增强与（ii）固定预算替换。我们使用PetEVAL衍生语料库（3,750条保留集/1,249条训练集）进行了受控模拟，采用隐私保护的“仅模板”机制（在LLM提示前移除标识符）生成了10,382条合成记录。在三种Transformer架构（PetBERT、VetBERT、Bio_ClinicalBERT）上进行了不同混合比例的训练。评估以文档级泄漏率（至少含有一个未识别标识符的文档比例）作为主要安全指标。结果显示：在固定样本替换机制下，用合成记录替代真实记录会单调增加泄漏率，表明合成数据无法安全替代真实监督。在计算匹配训练中，适度的合成混合能达到纯真实数据性能，但高比例合成主导会降低效用。相反，周期扩展增强策略提升了性能：PetBERT的跨度重叠F1值从0.831提升至0.850 +/- 0.014，泄漏率从6.32%降至4.02% +/- 0.19%。但这些增益主要反映训练暴露量的增加，而非合成数据的内在质量。语料库诊断显示合成与真实数据在记录长度和标签分布上存在系统性失配，这与持续泄漏现象相符。我们得出结论：合成增强能有效扩展训练暴露量，但对于安全关键的兽医去标识化任务而言，它应作为补充而非替代方案。