作者应为自身文档添加标注 (Authors Should Label Their Own Documents) - 专知论文

会员服务 ·

0

标注 · 系统 · 近似 · 时标 · Chatbot ·

Authors Should Label Their Own Documents

翻译：作者应为自身文档添加标注

Marcus Ma,Cole Johnson,Nolan Bridges,Jackson Trager,Georgios Chochlakis,Shrikanth Narayanan

Third-party annotation is the status quo for labeling text, but egocentric information such as sentiment and belief can at best only be approximated by a third-person proxy. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 20,000 users to deploy an author labeling annotation system. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation with author-labeled data to improve performance. We train our model to minimize the prediction error on questions generated for a set of predetermined subjective beliefs using author-labeled responses. Our model achieves a 537% improvement in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at https://academic.echogroup.ai.

翻译：第三方标注是文本标注的现状，但情感与信念等自我中心信息至多只能通过第三人称代理近似获取。本文提出作者标注技术，即文档撰写者在创作时对数据进行即时标注。我们与一款拥有超过20,000名用户的商用聊天机器人合作，部署了作者标注系统。该系统能识别任务相关查询、实时生成标注问题并记录作者回答。我们基于作者标注数据构建并部署了在线学习模型架构用于产品推荐，以提升性能。该模型通过最小化针对预设主观信念生成问题的预测误差进行训练，并采用作者标注响应作为监督信号。相较于同期运行的行业广告基线，我们的模型实现了点击率537%的提升。随后我们将作者标注与三种传统情感分析标注方法在质量与实用性方面进行比较，发现作者标注具有质量更高、获取更快、成本更低的优势。这些发现印证了现有研究结论：对于自我中心及主观信念的标注，作者自标注相比第三方标注能显著提升质量。为促进更广泛的科研应用，我们在https://academic.echogroup.ai向研究社区发布了作者标注服务。

0

相关内容

生成式人工智能数据标注安全规范

生成式人工智能数据标注安全规范

专知会员服务

52+阅读 · 2024年4月10日

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

专知会员服务

36+阅读 · 2022年7月10日

《人工智能面向机器学习的数据标注规程》国家标准意见稿

《人工智能面向机器学习的数据标注规程》国家标准意见稿

专知会员服务

115+阅读 · 2022年2月24日

【经典书】自然语言标注—用于机器学习，341页pdf

【经典书】自然语言标注—用于机器学习，341页pdf

专知会员服务

55+阅读 · 2021年2月12日

自动图像标注技术综述(中文版)，27页pdf

专知会员服务

39+阅读 · 2020年12月14日

最新《深度学习序列标记》综述论文，16页pdf134篇参考文献

最新《深度学习序列标记》综述论文，16页pdf134篇参考文献

专知会员服务

41+阅读 · 2020年11月18日

【2020关键词提取】使用多个本地功能从单个文档中提取关键字，YAKE! Keyword extraction from single documents using multiple local features

【2020关键词提取】使用多个本地功能从单个文档中提取关键字，YAKE! Keyword extraction from single documents using multiple local features

专知会员服务

26+阅读 · 2020年5月2日

你的毕业论文过了吗？宗老师这份《如何撰写毕业论文？》27页ppt帮你把把关，中科院自动化所模式国重宗成庆研究员

你的毕业论文过了吗？宗老师这份《如何撰写毕业论文？》27页ppt帮你把把关，中科院自动化所模式国重宗成庆研究员

专知会员服务

149+阅读 · 2020年4月3日

数据标注研究综述，软件学报，19页pdf

数据标注研究综述，软件学报，19页pdf

专知会员服务

95+阅读 · 2020年2月20日

《信息技术人工智能面向机器学习的数据标注规程》，中国电子工业标准化技术协会

《信息技术人工智能面向机器学习的数据标注规程》，中国电子工业标准化技术协会

专知会员服务

59+阅读 · 2019年12月14日

【数据集】OCR_DataSet：有关OCR的数据集并统一标注格式

【数据集】OCR_DataSet：有关OCR的数据集并统一标注格式

AINLP

18+阅读 · 2020年4月10日

数据标注术语和规范国家标准出炉,你的写法符合规范么?

数据标注术语和规范国家标准出炉,你的写法符合规范么?

专知

17+阅读 · 2019年3月21日

机器翻译学术论文写作方法和技巧

机器翻译学术论文写作方法和技巧

清华大学研究生教育

11+阅读 · 2018年12月23日

可能是 Android 上最好用的写作 App

可能是 Android 上最好用的写作 App

少数派

11+阅读 · 2018年12月21日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

神圣的NLP！一文理解词性标注、依存分析和命名实体识别任务

神圣的NLP！一文理解词性标注、依存分析和命名实体识别任务

深度学习与NLP

25+阅读 · 2018年8月22日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

怎样构建中文文本标注工具?（附工具、代码、论文等资源）

怎样构建中文文本标注工具?（附工具、代码、论文等资源）

数据派THU

14+阅读 · 2017年11月26日

微信OCR(1)——公众号图文识别中的文本检测

微信OCR(1)——公众号图文识别中的文本检测

微信AI

17+阅读 · 2017年11月22日

NLP中自动生产文摘（auto text summarization）

NLP中自动生产文摘（auto text summarization）

机器学习研究会

14+阅读 · 2017年10月10日

云计算环境中面向内容的密文检索关键技术研究

国家自然科学基金

0+阅读 · 2017年12月31日

面向大类别的空中手写中英文识别技术研究

国家自然科学基金

2+阅读 · 2017年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多样化查询的多标记主动学习研究

国家自然科学基金

0+阅读 · 2015年12月31日

读者视角的跨领域隐式情感分析理论及关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

中英文论文中的中国作者姓名消歧研究

国家自然科学基金

0+阅读 · 2014年12月31日

生命起源过程中“标签介导的遗传信息复制和表达的出现及演化”的计算机模拟研究

国家自然科学基金

0+阅读 · 2014年12月31日

笔迹图像中关键词语过滤技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

脱机手写藏文字符识别研究

国家自然科学基金

0+阅读 · 2014年12月31日

LLMs as Span Annotators: A Comparative Study of LLMs and Humans

Arxiv

0+阅读 · 2月2日

"Label from Somewhere": Reflexive Annotating for Situated AI Alignment

Arxiv

0+阅读 · 1月25日

AI Personalization Paradox: Personalized AI Increases Superficial Engagement in Reading while Undermines Autonomy and Ownership in Writing

Arxiv

0+阅读 · 1月25日

Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

Arxiv

0+阅读 · 1月24日

Designing and Evaluating AI Margin Notes in Document Reader Software

Arxiv

0+阅读 · 1月24日

Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

Arxiv

0+阅读 · 1月22日

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Arxiv

0+阅读 · 1月19日

Guidelines for the Creation of an Annotated Corpus

Arxiv

0+阅读 · 1月19日

Who Owns the Text? Design Patterns for Preserving Authorship in AI-Assisted Writing

Arxiv

0+阅读 · 1月15日

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Arxiv

0+阅读 · 1月14日

VIP会员

文章信息

相关主题

相关VIP内容

生成式人工智能数据标注安全规范

生成式人工智能数据标注安全规范

专知会员服务

52+阅读 · 2024年4月10日

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

长文档摘要如何做？莫纳什大学最新《长文档摘要》综述，39页pdf长文档摘要的实证研究:数据集、模型和指标

专知会员服务

36+阅读 · 2022年7月10日

《人工智能面向机器学习的数据标注规程》国家标准意见稿

《人工智能面向机器学习的数据标注规程》国家标准意见稿

专知会员服务

115+阅读 · 2022年2月24日

【经典书】自然语言标注—用于机器学习，341页pdf

【经典书】自然语言标注—用于机器学习，341页pdf

专知会员服务

55+阅读 · 2021年2月12日

自动图像标注技术综述(中文版)，27页pdf

专知会员服务

39+阅读 · 2020年12月14日

最新《深度学习序列标记》综述论文，16页pdf134篇参考文献

最新《深度学习序列标记》综述论文，16页pdf134篇参考文献

专知会员服务

41+阅读 · 2020年11月18日

【2020关键词提取】使用多个本地功能从单个文档中提取关键字，YAKE! Keyword extraction from single documents using multiple local features

【2020关键词提取】使用多个本地功能从单个文档中提取关键字，YAKE! Keyword extraction from single documents using multiple local features

专知会员服务

26+阅读 · 2020年5月2日

你的毕业论文过了吗？宗老师这份《如何撰写毕业论文？》27页ppt帮你把把关，中科院自动化所模式国重宗成庆研究员

你的毕业论文过了吗？宗老师这份《如何撰写毕业论文？》27页ppt帮你把把关，中科院自动化所模式国重宗成庆研究员

专知会员服务

149+阅读 · 2020年4月3日

数据标注研究综述，软件学报，19页pdf

数据标注研究综述，软件学报，19页pdf

专知会员服务

95+阅读 · 2020年2月20日

《信息技术人工智能面向机器学习的数据标注规程》，中国电子工业标准化技术协会

《信息技术人工智能面向机器学习的数据标注规程》，中国电子工业标准化技术协会

专知会员服务

59+阅读 · 2019年12月14日

热门VIP内容

开通专知VIP会员享更多权益服务

美国防部门开始扩建金穹反导系统基础设施

《基于选择性深度神经网络分类的弹性无线通信》最新报告

《多域作战中融合网络、电子战与动能机动》

《在东欧磨砺反无人机技能》美陆军最新反无人机训练报告

相关资讯

【数据集】OCR_DataSet：有关OCR的数据集并统一标注格式

【数据集】OCR_DataSet：有关OCR的数据集并统一标注格式

AINLP

18+阅读 · 2020年4月10日

数据标注术语和规范国家标准出炉,你的写法符合规范么?

数据标注术语和规范国家标准出炉,你的写法符合规范么?

专知

17+阅读 · 2019年3月21日

机器翻译学术论文写作方法和技巧

机器翻译学术论文写作方法和技巧

清华大学研究生教育

11+阅读 · 2018年12月23日

可能是 Android 上最好用的写作 App

可能是 Android 上最好用的写作 App

少数派

11+阅读 · 2018年12月21日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

神圣的NLP！一文理解词性标注、依存分析和命名实体识别任务

神圣的NLP！一文理解词性标注、依存分析和命名实体识别任务

深度学习与NLP

25+阅读 · 2018年8月22日

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

干货｜当深度学习遇见自动文本摘要，seq2seq+attention

机器学习算法与Python学习

10+阅读 · 2018年5月28日

怎样构建中文文本标注工具?（附工具、代码、论文等资源）

怎样构建中文文本标注工具?（附工具、代码、论文等资源）

数据派THU

14+阅读 · 2017年11月26日

微信OCR(1)——公众号图文识别中的文本检测

微信OCR(1)——公众号图文识别中的文本检测

微信AI

17+阅读 · 2017年11月22日

NLP中自动生产文摘（auto text summarization）

NLP中自动生产文摘（auto text summarization）

机器学习研究会

14+阅读 · 2017年10月10日

相关论文

LLMs as Span Annotators: A Comparative Study of LLMs and Humans

Arxiv

0+阅读 · 2月2日

"Label from Somewhere": Reflexive Annotating for Situated AI Alignment

Arxiv

0+阅读 · 1月25日

AI Personalization Paradox: Personalized AI Increases Superficial Engagement in Reading while Undermines Autonomy and Ownership in Writing

Arxiv

0+阅读 · 1月25日

Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

Arxiv

0+阅读 · 1月24日

Designing and Evaluating AI Margin Notes in Document Reader Software

Arxiv

0+阅读 · 1月24日

Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

Arxiv

0+阅读 · 1月22日

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Arxiv

0+阅读 · 1月19日

Guidelines for the Creation of an Annotated Corpus

Arxiv

0+阅读 · 1月19日

Who Owns the Text? Design Patterns for Preserving Authorship in AI-Assisted Writing

Arxiv

0+阅读 · 1月15日

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Arxiv

0+阅读 · 1月14日

相关基金

云计算环境中面向内容的密文检索关键技术研究

国家自然科学基金

0+阅读 · 2017年12月31日

面向大类别的空中手写中英文识别技术研究

国家自然科学基金

2+阅读 · 2017年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多样化查询的多标记主动学习研究

国家自然科学基金

0+阅读 · 2015年12月31日

读者视角的跨领域隐式情感分析理论及关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

中英文论文中的中国作者姓名消歧研究

国家自然科学基金

0+阅读 · 2014年12月31日

生命起源过程中“标签介导的遗传信息复制和表达的出现及演化”的计算机模拟研究

国家自然科学基金

0+阅读 · 2014年12月31日

笔迹图像中关键词语过滤技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

脱机手写藏文字符识别研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员