Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

翻译：AI辅助编程已迅速重塑软件实践与研究流程，但当前模型在复杂3D几何视觉的正确代码生成上仍存在困难。若模型能可靠编写此类代码，计算机视觉领域的研究方式将发生根本性变革。为衡量该目标的进展，我们提出GeoCodeBench——一个评估3D视觉编码能力的博士级基准测试。每个问题均为函数补全型实现任务，素材选自近期会议的代表性论文：首先利用工具从官方代码库中提取候选函数，再通过人工筛选确定核心3D几何组件。针对每个目标函数，我们生成覆盖边界条件的多样化单元测试，实现全自动可重复评分。我们评测了涵盖开源与闭源的八种代表性模型以反映当前生态。最优模型GPT-5仅达到36.6%的通过率，揭示当前能力与可靠3D科学编码之间的巨大鸿沟。GeoCodeBench采用双层任务架构：通用3D能力（几何变换与力学/光学公式化）与研究能力（新型算法实现与几何逻辑路由）。各维度得分呈正相关，但研究型任务显著更具挑战性。上下文消融实验进一步显示"更多论文文本"并非总是更优：截取方法章节的输入在统计上显著优于全文输入，凸显长上下文科学理解中未解决的难题。综上，这些发现将GeoCodeBench定位为从通用编码迈向可信3D几何视觉编码的严格测试平台。