Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

from arxiv, 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills . Archived on Zenodo: concept DOI https://doi.org/10.5281/zenodo.20155321 and version DOI (v3.8.0) https://doi.org/10.5281/zenodo.20582972

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

翻译：随着自主研究智能体与AI联合科学家系统推动大语言模型从起草向端到端手稿生产转变，瓶颈正从生成环节转向验证环节。流畅的LLM输出可能隐藏虚构引用、偏离源表格的数字以及未满足的报告指南条目；现有工具仅生成而不验证，自我批判则继承了导致自信虚构的盲区。我们描述了一种将生成与验证相结合的架构，其基于三项原则：将工作流分解为自包含技能、在每一阶段转换处设置故障停止门控、以及用成本最低的充分机制（在可行时采用确定性可重执行检查，仅在需要解释时才使用散文级探测）解决每个完整性问题。这一"确定性优先"的划分策略，以完整性门控分类法形式组织，构成本文的核心贡献。其具体实现为MedSci Skills——一个包含43项技能的开源工具包，其中21项属于确定性检测层级，并在三个公开数据集流程（STARD、PRISMA、STROBE）及注入缺陷消融实验中评估。在三个流程中，所有内容哈希清单均验证为洁净且门控揭示了真实缺陷；针对27个相同注入缺陷，确定性门控检测出全部27个且对匹配的洁净样本无假阳性，而单提示LLM评审员仅检测出11个，其遗漏的代码、参考文献及风格缺陷正是散文所隐藏的。"确定性优先"验证产生了可审计、可重执行的轨迹，揭示了人类验证LLM辅助手稿所需证据：可行性与可复现性证据，而非声称达到人类竞争质量——后者由独立盲法研究处理。MedSci Skills采用MIT许可证并归档（v3.8.0）。