Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.

翻译：大型语言模型智能体日益通过程序化接口操作外部系统，然而从业者缺乏关于如何构建这些智能体所处理上下文的实证指导。以SQL生成为程序化智能体操作的代理任务，我们对结构化数据的上下文工程进行了系统性研究，涵盖11个模型、4种格式（YAML、Markdown、JSON、面向令牌的对象表示法[TOON]）以及10至10,000个表的模式，共计9,649次实验。我们的研究发现挑战了普遍假设。首先，架构选择具有模型依赖性：基于文件的上下文检索能提升前沿级模型（Claude、GPT、Gemini；+2.7%，p=0.029）的准确性，但对开源模型则呈现混合结果（总体-7.7%，p<0.001），且性能损失随模型差异显著。其次，格式对总体准确性无显著影响（卡方值=2.45，p=0.484），但个别模型（尤其是开源模型）表现出对特定格式的敏感性。第三，模型能力是主导因素，前沿级与开源级模型之间存在21个百分点的准确率差距，远超任何格式或架构效应。第四，通过领域分区模式，文件原生智能体可扩展至10,000个表并保持高导航准确率。第五，文件大小不能预测运行效率：由于格式不熟悉的搜索模式，紧凑格式在大规模场景下可能消耗显著更多的令牌。这些发现为从业者在结构化系统上部署LLM智能体提供了基于证据的指导，表明架构决策应根据模型能力进行定制，而非假定存在普适的最佳实践。