Automating API Documentation from Crowdsourced Knowledge - 专知论文

会员服务 ·

0

API · 知识 · 语言模型 · 冗余 · 基线 ·

Automating API Documentation from Crowdsourced Knowledge

翻译：基于众包知识自动生成API文档

Bonan Kou,Zijie Zhou,Muhao Chen,Tianyi Zhang

from arxiv, 13 pages, 2 figures, Accepted to ICSE 2026

API documentation is crucial for developers to learn and use APIs. However, it is known that many official API documents are obsolete and incomplete. To address this challenge, we propose a new approach called AutoDoc that generates API documents with API knowledge extracted from online discussions on Stack Overflow (SO). AutoDoc leverages a fine-tuned dense retrieval model to identify seven types of API knowledge from SO posts. Then, it uses GPT-4o to summarize the API knowledge in these posts into concise text. Meanwhile, we designed two specific components to handle LLM hallucination and redundancy in generated content. We evaluated AutoDoc against five comparison baselines on 48 APIs of different popularity levels. Our results indicate that the API documents generated by AutoDoc are up to 77.7% more accurate, 9.5% less duplicated, and contain 34.4% knowledge uncovered by the official documents. We also measured the sensitivity of AutoDoc to the choice of different LLMs. We found that while larger LLMs produce higher-quality API documents, AutoDoc enables smaller open-source models (e.g., Mistral-7B-v0.3) to achieve comparable results. Finally, we conducted a user study to evaluate the usefulness of the API documents generated by AutoDoc. All participants found API documents generated by AutoDoc to be more comprehensive, concise, and helpful than the comparison baselines. This highlights the feasibility of utilizing LLMs for API documentation with careful design to counter LLM hallucination and information redundancy.

翻译：API文档对于开发者学习和使用API至关重要。然而，众所周知，许多官方API文档存在过时和不完整的问题。为应对这一挑战，我们提出了一种名为AutoDoc的新方法，该方法通过从Stack Overflow（SO）在线讨论中提取API知识来生成API文档。AutoDoc利用微调的密集检索模型从SO帖子中识别七类API知识，随后使用GPT-4o将这些帖子中的API知识总结为简洁文本。同时，我们设计了两个专用组件来处理生成内容中的大语言模型幻觉和冗余问题。我们在48个不同流行度等级的API上对AutoDoc与五种基线方法进行了评估。结果表明，AutoDoc生成的API文档准确率最高提升77.7%，重复率降低9.5%，且包含34.4%官方文档未覆盖的知识。我们还测量了AutoDoc对不同大语言模型选择的敏感性，发现虽然更大规模的大语言模型能生成更高质量的API文档，但AutoDoc能使较小的开源模型（如Mistral-7B-v0.3）达到可比的结果。最后，我们通过用户研究评估了AutoDoc生成API文档的实用性。所有参与者均认为，与基线方法相比，AutoDoc生成的API文档更全面、简洁且实用。这凸显了通过精心设计应对大语言模型幻觉和信息冗余问题，利用大语言模型生成API文档的可行性。

0

相关内容

API

应用程序接口（简称 API），又称为应用编程接口，就是软件系统不同组成部分衔接的约定。

【2023新书】用ChatGPT API构建AI应用：通过开发十个创新的AI项目掌握ChatGPT、Whisper和DALL-E

【2023新书】用ChatGPT API构建AI应用：通过开发十个创新的AI项目掌握ChatGPT、Whisper和DALL-E

专知会员服务

136+阅读 · 2023年9月27日

【MIT博士论文】从结构化文档到结构化知识, 150页pdf

【MIT博士论文】从结构化文档到结构化知识, 150页pdf

专知会员服务

51+阅读 · 2023年8月10日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【ICML2023】基于自然语言指令的受控文本生成

【ICML2023】基于自然语言指令的受控文本生成

专知会员服务

29+阅读 · 2023年4月28日

代码注释自动生成方法综述

专知会员服务

16+阅读 · 2021年1月23日

最新《知识驱动的文本生成》综述论文，44页pdf

最新《知识驱动的文本生成》综述论文，44页pdf

专知会员服务

78+阅读 · 2020年10月13日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

Tensorflow GNN实战：手把手教你使用tf_geometric构建图自编码器GAE（附完整代码）

Tensorflow GNN实战：手把手教你使用tf_geometric构建图自编码器GAE（附完整代码）

专知会员服务

76+阅读 · 2020年1月30日

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

专知会员服务

70+阅读 · 2020年1月17日

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

专知会员服务

27+阅读 · 2019年8月29日

最新《知识驱动的文本生成》综述论文，44页pdf

最新《知识驱动的文本生成》综述论文，44页pdf

专知

26+阅读 · 2020年10月14日

自然语言生成资源列表

自然语言生成资源列表

专知

17+阅读 · 2020年1月4日

文本生成公开数据集/开源工具/经典论文详细列表分享

文本生成公开数据集/开源工具/经典论文详细列表分享

深度学习与NLP

30+阅读 · 2019年9月22日

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

专知

21+阅读 · 2019年3月26日

自编码表示学习 25页最新进展综述，90篇参考文献

自编码表示学习 25页最新进展综述，90篇参考文献

专知

34+阅读 · 2018年12月18日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

【干货】深入理解自编码器（附代码实现）

【干货】深入理解自编码器（附代码实现）

专知

136+阅读 · 2018年3月9日

【干货】快速上手图像识别：用TensorFlow API实现图像分类实例

【干货】快速上手图像识别：用TensorFlow API实现图像分类实例

专知

25+阅读 · 2018年1月18日

NLP中自动生产文摘（auto text summarization）

NLP中自动生产文摘（auto text summarization）

机器学习研究会

14+阅读 · 2017年10月10日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于上下文精化的并发对象活性的描述及验证

国家自然科学基金

1+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向Bug报告的软件故障重现方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

不确定知识图谱中面向结构查询的众包清洗研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于生态演替的文本大数据特征学习研究

国家自然科学基金

1+阅读 · 2015年12月31日

中文句子语义概念图自动构建方法及应用研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

Automated Proof Generation for Rust Code via Self-Evolution

Arxiv

0+阅读 · 2月14日

Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks

Arxiv

0+阅读 · 2月8日

Compendia: Automated Visual Storytelling Generation from Online Article Collection

Arxiv

0+阅读 · 2月7日

Evaluating Retrieval-Augmented Generation Variants for Natural Language-Based SQL and API Call Generation

Arxiv

0+阅读 · 2月6日

Adaptive Prompt Elicitation for Text-to-Image Generation

Arxiv

0+阅读 · 2月4日

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Arxiv

0+阅读 · 2月3日

Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging

Arxiv

0+阅读 · 2月3日

Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction

Arxiv

0+阅读 · 1月30日

OpenAI for OpenAPI: Automated generation of REST API specification via LLMs

Arxiv

0+阅读 · 1月19日

An Introduction to Autoencoders

Arxiv

17+阅读 · 2022年1月11日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

1+阅读 · 今天14:45

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

1+阅读 · 今天14:43

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

3+阅读 · 今天14:31

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 今天14:20

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

2+阅读 · 今天14:11

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

3+阅读 · 今天14:07

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

3+阅读 · 今天14:03

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 今天13:59

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

5+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

【2023新书】用ChatGPT API构建AI应用：通过开发十个创新的AI项目掌握ChatGPT、Whisper和DALL-E

【2023新书】用ChatGPT API构建AI应用：通过开发十个创新的AI项目掌握ChatGPT、Whisper和DALL-E

专知会员服务

136+阅读 · 2023年9月27日

【MIT博士论文】从结构化文档到结构化知识, 150页pdf

【MIT博士论文】从结构化文档到结构化知识, 150页pdf

专知会员服务

51+阅读 · 2023年8月10日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【ICML2023】基于自然语言指令的受控文本生成

【ICML2023】基于自然语言指令的受控文本生成

专知会员服务

29+阅读 · 2023年4月28日

代码注释自动生成方法综述

专知会员服务

16+阅读 · 2021年1月23日

最新《知识驱动的文本生成》综述论文，44页pdf

最新《知识驱动的文本生成》综述论文，44页pdf

专知会员服务

78+阅读 · 2020年10月13日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

Tensorflow GNN实战：手把手教你使用tf_geometric构建图自编码器GAE（附完整代码）

Tensorflow GNN实战：手把手教你使用tf_geometric构建图自编码器GAE（附完整代码）

专知会员服务

76+阅读 · 2020年1月30日

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

专知会员服务

70+阅读 · 2020年1月17日

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

【VLDB2019 tutorial】TextCube：自动构建和多维探索，TextCube: Automated Construction and Multidimensional Exploration，韩家炜，Jingbo Shang

专知会员服务

27+阅读 · 2019年8月29日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

最新《知识驱动的文本生成》综述论文，44页pdf

最新《知识驱动的文本生成》综述论文，44页pdf

专知

26+阅读 · 2020年10月14日

自然语言生成资源列表

自然语言生成资源列表

专知

17+阅读 · 2020年1月4日

文本生成公开数据集/开源工具/经典论文详细列表分享

文本生成公开数据集/开源工具/经典论文详细列表分享

深度学习与NLP

30+阅读 · 2019年9月22日

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

用一行tf.data实现数据Shuffle、Batch划分、异步预加载等

专知

21+阅读 · 2019年3月26日

自编码表示学习 25页最新进展综述，90篇参考文献

自编码表示学习 25页最新进展综述，90篇参考文献

专知

34+阅读 · 2018年12月18日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

【干货】深入理解自编码器（附代码实现）

【干货】深入理解自编码器（附代码实现）

专知

136+阅读 · 2018年3月9日

【干货】快速上手图像识别：用TensorFlow API实现图像分类实例

【干货】快速上手图像识别：用TensorFlow API实现图像分类实例

专知

25+阅读 · 2018年1月18日

NLP中自动生产文摘（auto text summarization）

NLP中自动生产文摘（auto text summarization）

机器学习研究会

14+阅读 · 2017年10月10日

相关论文

Automated Proof Generation for Rust Code via Self-Evolution

Arxiv

0+阅读 · 2月14日

Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks

Arxiv

0+阅读 · 2月8日

Compendia: Automated Visual Storytelling Generation from Online Article Collection

Arxiv

0+阅读 · 2月7日

Evaluating Retrieval-Augmented Generation Variants for Natural Language-Based SQL and API Call Generation

Arxiv

0+阅读 · 2月6日

Adaptive Prompt Elicitation for Text-to-Image Generation

Arxiv

0+阅读 · 2月4日

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Arxiv

0+阅读 · 2月3日

Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging

Arxiv

0+阅读 · 2月3日

Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction

Arxiv

0+阅读 · 1月30日

OpenAI for OpenAPI: Automated generation of REST API specification via LLMs

Arxiv

0+阅读 · 1月19日

An Introduction to Autoencoders

Arxiv

17+阅读 · 2022年1月11日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于上下文精化的并发对象活性的描述及验证

国家自然科学基金

1+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向Bug报告的软件故障重现方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

不确定知识图谱中面向结构查询的众包清洗研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于生态演替的文本大数据特征学习研究

国家自然科学基金

1+阅读 · 2015年12月31日

中文句子语义概念图自动构建方法及应用研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

基于自适应模型检测的安全协议自动建模与设计研究

国家自然科学基金

1+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员