Code Digital Twin: A Knowledge Infrastructure for AI-Assisted Complex Software Development

Recent advances in AI coding tools powered by large language models (LLMs) have shown strong capabilities in software engineering tasks, raising expectations of major productivity gains. Tools such as Cursor and Claude Code have popularized "vibe coding" (where developers steer development through high-level intent), commonly relying on context engineering and Retrieval-Augmented Generation (RAG) to ground generation in a codebase. However, these paradigms struggle in ultra-complex enterprise systems, where software evolves incrementally under pervasive design constraints and depends on tacit knowledge such as responsibilities, intent, and decision rationales distributed across code, configurations, discussions, and version history. In this environment, context engineering faces a fundamental barrier: the required context is scattered across artifacts and entangled across time, beyond the capacity of LLMs to reliably capture, prioritize, and fuse evidence into correct and trustworthy decisions, even as context windows grow. To bridge this gap, we propose the Code Digital Twin, a persistent and evolving knowledge infrastructure built on the codebase. It separates long-term knowledge engineering from task-time context engineering and serves as a backend "context engine" for AI coding assistants. The Code Digital Twin models both the physical and conceptual layers of software and co-evolves with the system. By integrating hybrid knowledge representations, multi-stage extraction pipelines, incremental updates, AI-empowered applications, and human-in-the-loop feedback, it transforms fragmented knowledge into explicit and actionable representations, providing a roadmap toward sustainable and resilient development and evolution of ultra-complex systems.

翻译：近年来，基于大语言模型（LLMs）的AI编程工具在软件工程任务中展现出强大能力，引发了人们对生产力大幅提升的期待。诸如Cursor和Claude Code等工具推动了“氛围编程”（即开发者通过高层意图引导开发）的普及，这类方法通常依赖于上下文工程和检索增强生成（RAG）技术，将生成过程锚定在代码库上。然而，在超复杂的企业级系统中，这些范式面临严峻挑战：此类软件在普遍的设计约束下逐步演进，且依赖于分散在代码、配置、讨论和版本历史中的隐性知识（如职责、意图和决策依据）。在此环境中，上下文工程面临一个根本性障碍：所需的上下文信息分散在不同制品中，且随时间推移相互交织，超出了LLMs可靠捕获、优先级排序及融合证据以形成正确可信决策的能力范围——即使其上下文窗口不断扩展。为弥合这一鸿沟，我们提出代码数字孪生，这是一种构建于代码库之上、持续演进的知识基础设施。它将长期知识工程与任务时上下文工程相分离，作为AI编程助手的后端“上下文引擎”。代码数字孪生同时建模软件的物理层与概念层，并与系统协同演化。通过集成混合知识表示、多阶段提取流水线、增量更新机制、AI赋能应用以及人在回路的反馈，它将碎片化知识转化为显式且可操作的表征，为超复杂系统的可持续性与韧性演进提供了实现路径。