Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ($\geq$94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.
翻译:软件语言会因功能扩展等原因随时间演化。当语法发生演化时,原本符合语法的文本实例可能变得过时。虽然模型驱动工程提供了许多技术来支持模型随元模型变更的协同演化,但这些方法并非为文本领域特定语言设计,且可能丢失布局和注释等对人类重要的信息。本研究系统评估了大型语言模型在文本领域特定语言的语法与实例协同演化方面的潜力。通过使用Claude Sonnet 4.5和GPT-5.2模型,在十种案例语言上各进行十次实验,我们同时评估了正确性和人本信息的保留程度。结果显示:在小型案例上表现优异(对于需要修改少于20行的实例,精确率和召回率均≥94%),但性能随规模扩大而下降——Claude在40行实例上仍保持85%的召回率,而GPT在最大实例上完全失效。响应时间随实例规模显著增加,语法演化的复杂性和删除操作的粒度对性能的影响比变更类型更为显著。这些发现明确了基于LLM的协同演化技术的有效适用范围及当前局限所在。