Text documents, including programs, typically have human-readable semantic structure. Historically, programmatic access to these semantics has required explicit in-document tagging. Especially in systems where the text has an execution semantics, this means it is an opt-in feature that is hard to support properly. Today, language models offer a new method: metadata can be bound to entities in changing text using a model's human-like understanding of semantics, with no requirements on the document structure. This method expands the applications of document annotation, a fundamental operation in program writing, debugging, maintenance, and presentation. We contribute a system that employs an intelligent agent to re-tag modified programs, enabling rich annotations to automatically follow code as it evolves. We also contribute a formal problem definition, an empirical synthetic benchmark suite, and our benchmark generator. Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag. While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
翻译:文本文档(包括程序)通常具有人类可读的语义结构。在历史上,程序化访问这些语义需要文档内显式标记。特别是在文本具有执行语义的系统中,这意味着这是一种难以恰当支持的选配功能。如今,语言模型提供了一种新方法:利用模型对人类语义理解的能力,将元数据绑定到动态变化文本中的实体上,而无需对文档结构提出任何要求。该方法扩展了文档标注的应用——这是程序编写、调试、维护和展示中的基础操作。我们贡献了一个系统,该系统采用智能代理对修改后的程序进行重新标记,使丰富的注释能够随代码演进自动迁移。同时,我们还提供了形式化的问题定义、经验性合成基准测试套件及其生成器。我们的系统在基准测试中达到90%的准确率,并能以每标签5秒的速度并行替换文档标签。尽管仍有显著改进空间,但我们发现其性能足够可靠,足以支撑进一步的应用探索。