Large language models can write scientific code, but direct paper-to-program translation remains fragile when correctness depends on tacit conventions rather than explicit equations. We frame this as a \textbf{knowledge-externalization} problem: index choices, gauges, fermionic signs, contraction order, validation gates, and scaling constraints must be made explicit before code generation. We evaluate a multi-stage, human-in-the-loop workflow on two quantum many-body tasks. DMRG from Schollwock's pedagogical review serves as calibration: specification-guided implementations pass in all 16 model pairings, compared with 6/13 direct attempts, and a prose-specification ablation shows that externalized content, not \LaTeX{} form, is the active ingredient. Pfaffian conversion of HFB states to MPS from the five-page Letter by Jin et al. serves as the stress test: no public implementation is available, and success depends on tacit sign, gauge, ordering, and scalability conventions. Here the workflow yields 11/26 audited passes, while direct prompting yields none. Cross-specification transfer is asymmetric: non-GPT specifications implemented by GPT~5.5 pass 4/4, whereas GPT~5.5 specifications implemented by weaker models fail 4/4. The contrast supports a two-bottleneck picture. Externalization resolves the first bottleneck -- paper-to-code ambiguity -- well enough to make DMRG reproducible and Pfaffian-MPS auditable. The remaining failures expose a second bottleneck in implementation-model capability. Iterative meta-specification moves this boundary but does not eliminate it. The resulting \emph{Paper-to-Program Many-Body} skill is both a reusable implementation protocol and a diagnostic instrument for AI-assisted many-body programming.
翻译:暂无翻译