Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation

Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle. A traditional approach is a registry that provides the API documentation of the systems' endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce the token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform na\"ive chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score.

翻译：集成多个（子）系统对于构建高级信息系统至关重要。主要困难在于信息系统生命周期中动态环境的集成。传统方法是采用注册表来提供系统端点的API文档。大型语言模型已展现出基于此类文档自动创建系统集成（例如服务组合）的能力，但由于输入令牌限制，特别是面对全面的API描述时，需要简洁的输入。目前，如何最佳预处理这些API描述尚不明确。在本工作中，我们（i）分析了检索增强生成在端点发现中的应用，以及OpenAPI的分块（即预处理）方法，旨在减少输入令牌长度同时保留最相关信息。为进一步减少组合提示的输入令牌长度并改进端点检索，我们提出（ii）一种发现代理，该代理仅接收最相关端点的摘要并按需检索详细信息。我们使用RestBench基准评估RAG在端点发现中的表现：首先针对不同分块方案和参数，测量端点检索的召回率、精确率和F1分数；随后使用相同测试集评估发现代理。通过原型系统，我们展示了如何成功运用RAG进行端点发现以降低令牌消耗。虽然召回率、精确率和F1分数均表现优异，但仍需进一步研究以检索所有必需端点。实验表明，在预处理方面，基于LLM和特定格式的方法优于朴素分块方法。引入代理机制可进一步优化结果，因为代理将任务拆分为多个细粒度子任务，从而在令牌消耗、精确率和F1分数方面提升了整体RAG性能。