Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.
翻译:大型语言模型在回答有关教科书、讲义和编程习题的问题时,若能基于明确的知识源进行回答,其可靠性会更高。检索增强生成(RAG)是一种常见方法:在回答问题前,先检索文档的相关片段并将其插入模型上下文。对于数学和技术类材料,原始LaTeX源码比PDF更适合作为起点,因为它包含了结构信息、标签、章节命令、宏定义以及作者的意图——这些内容在PDF提取过程中常常丢失或失真。然而,LaTeX源码并非天然对AI友好:交叉引用需要解析、自定义宏需被解释、习题和示例需被识别,且可能还需要作者提供的语义元数据。本文描述了一种聚焦于预处理的方法,用于将LaTeX源码及其编译生成的辅助文件和可选作者注释,转换为适合在向量数据库中建立索引的Markdown和JSONL分块。