Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

翻译：LLM API调用正成为普遍存在的程序构造，但它们创建了一个现有程序分析无法跨越的边界：运行时值进入自然语言提示，在LLM内部经过不透明处理，然后重新以程序消费的代码、SQL、JSON或文本形式出现。每个跨函数边界跟踪数据的分析（包括污点分析、程序切片、依赖分析和变更影响分析）都依赖被调用行为的数据流摘要。LLM调用缺乏此类摘要，使得所有这些分析在我们所称的"自然语言/编程语言边界"处失效。我们提出了首个跨越这一边界的信息流方法。基于量化信息流理论，我们的分类法沿两个正交维度定义了24个标签：信息保留程度（从词汇保留到完全阻断）和输出模态（自然语言、结构化格式、可执行产物）。我们从4154个真实世界Python文件中标注了9083对占位符-输出对，并通过Cohen's κ=0.82和近乎完全的覆盖率（0.01%无法分类）验证了可靠性。我们在两个下游应用中展示了该分类法的实用性：（1）一个结合分类法过滤与LLM验证的两阶段污点传播管道，在353个专家标注对上达到F1=0.923，对六个真实世界OpenClaw提示注入案例的跨语言验证进一步确认了有效性；（2）基于分类法的后向切片将包含非传播占位符的文件的切片大小平均缩减15%。逐标签分析表明，四个阻断标签几乎涵盖了所有非传播情况，为工具构建者提供了可行的过滤标准。