Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.
翻译:编程代理日益成为代码库级别的协作者,能够辅助代码库转换,但这一进展暴露了一个关键弱点:代理常常过度信任自身的局部验证流程,在那些仅满足表面检查却违背用户实际关心的语义约定的产物上宣称成功。这个问题在代码库转换中尤为严重,因为此前的评估主要基于结果驱动,因此不稳定:两个实现可能在浅层结果(如单次前向损失)上匹配,却在梯度、优化器行为或短周期训练动态上出现分歧。我们提出了T2J-Bench,一个代码库转换的基准测试,它将转换重新定义为在固定等价约定下的迁移。一个固定的验证器随后通过三个有序阶段比较源代码库和转换后的代码库:规范(接口可接纳性)、数值(前向输出、损失、梯度和特定目标的张量)以及行为(在固定随机种子下的短周期训练动态)。在355次盲转换尝试中,最佳系统仅达到26.7%-28.9%的整体通过率,尽管规范通过率高达91.1%;4.7倍的令牌预算差异仅产生2.2倍的通过率差异;并且所有系统相对于固定评估器高估了66.6-97.8个百分点的成功率。这表明,失败更多地源于与约定不匹配的自我验证,而非有限的预算或骨干网络强度。