We introduce TeMLM, a set of transparency-first release artifacts for clinical language models. TeMLM unifies provenance, data transparency, modeling transparency, and governance into a single, machine-checkable release bundle. We define an artifact suite (TeMLM-Card, TeMLM-Datasheet, TeMLM-Provenance) and a lightweight conformance checklist for repeatable auditing. We instantiate the artifacts on Technetium-I, a large-scale synthetic clinical NLP dataset with 498,000 notes, 7.74M PHI entity annotations across 10 types, and ICD-9-CM diagnosis labels, and report reference results for ProtactiniumBERT (about 100 million parameters) on PHI de-identification (token classification) and top-50 ICD-9 code extraction (multi-label classification). We emphasize that synthetic benchmarks are valuable for tooling and process validation, but models should be validated on real clinical data prior to deployment.
翻译:我们提出TeMLM,一套面向临床语言模型的透明优先发布构件。TeMLM将数据溯源、数据透明度、建模透明度与治理机制统一整合至单一可机器校验的发布包中。我们定义了一套构件体系(TeMLM-Card、TeMLM-Datasheet、TeMLM-Provenance)及用于可重复审计的轻量级合规检查表。我们在Technetium-I数据集上实例化了该构件体系——这是一个包含49.8万份临床记录、覆盖10类实体共774万个受保护健康信息(PHI)标注及ICD-9-CM诊断标签的大规模合成临床NLP数据集,并报告了ProtactiniumBERT模型(约1亿参数)在PHI去标识化(令牌分类)和ICD-9前50位编码提取(多标签分类)任务上的基准结果。我们强调,合成基准对工具链和流程验证具有重要价值,但模型在部署前仍需通过真实临床数据的验证。