AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

翻译：大型语言模型在回答有关教科书、讲义和编程习题的问题时，若能基于明确的知识源进行回答，其可靠性会更高。检索增强生成（RAG）是一种常见方法：在回答问题前，先检索文档的相关片段并将其插入模型上下文。对于数学和技术类材料，原始LaTeX源码比PDF更适合作为起点，因为它包含了结构信息、标签、章节命令、宏定义以及作者的意图——这些内容在PDF提取过程中常常丢失或失真。然而，LaTeX源码并非天然对AI友好：交叉引用需要解析、自定义宏需被解释、习题和示例需被识别，且可能还需要作者提供的语义元数据。本文描述了一种聚焦于预处理的方法，用于将LaTeX源码及其编译生成的辅助文件和可选作者注释，转换为适合在向量数据库中建立索引的Markdown和JSONL分块。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【AAAI2026】TruthfulRAG：基于知识图谱解决检索增强生成中的事实层冲突

专知会员服务

22+阅读 · 2025年11月15日

检索增强生成（RAG）技术，261页slides

专知会员服务

42+阅读 · 2025年10月16日

【EMNLP2025】ReCode：基于细粒度检索增强生成的LLM代码修复方法

专知会员服务

10+阅读 · 2025年9月3日

【新书】检索增强生成（RAG）入门指南

专知会员服务

30+阅读 · 2025年6月25日