提升标准化字段与术语的十条简明准则 (10 Simple Rules for Improving Your Standardized Fields and Terms) - 专知论文

会员服务 ·

0

准则 · 情境 · 元数据 · 重用 · 操作 ·

10 Simple Rules for Improving Your Standardized Fields and Terms

翻译：提升标准化字段与术语的十条简明准则

Rhiannon Cameron,Emma Griffiths,Damion Dooley,William Hsiao

from arxiv, 17 pages, 1 figure Author Contributions: Conceptualization by EG and RC. Manuscript writing by RC. Revisions and Editing by RC, EG, DD, and WH. Acknowledgements: Charlotte Barclay Version 2: Added missing word on page 10

Contextual metadata is the unsung hero of research data. When done right, standardized and structured vocabularies make your data findable, shareable, and reusable. When done wrong, they turn a well intended effort into data cleanup and curation nightmares. In this paper we tackle the surprisingly tricky process of vocabulary standardization with a mix of practical advice and grounded examples. Drawing from real-world experience in contextual data harmonization, we highlight common challenges (e.g., semantic noise and concept bombs) and provide actionable strategies to address them. Our rules emphasize alignment with Findability, Accessibility, Interoperability, and Reusability (FAIR) principles while remaining adaptable to evolving user and research needs. Whether you are curating datasets, designing a schema, or contributing to a standards body, these rules aim to help you create metadata that is not only technically sound but also meaningful to users.

翻译：情境化元数据是研究数据中默默无闻的英雄。当标准化与结构化词汇表处理得当时，它们能使您的数据可发现、可共享且可重用；若处理不当，则会使原本善意的努力转变为数据清理与管理的噩梦。本文结合实用建议与具体实例，探讨词汇标准化这一异常棘手的流程。基于情境数据协调的实际经验，我们重点分析了常见挑战（如语义噪声与概念爆炸），并提供了可操作的应对策略。我们的准则强调与可发现性、可访问性、互操作性和可重用性（FAIR）原则保持一致，同时保持对不断变化的用户与研究需求的适应性。无论您是管理数据集、设计数据模式，还是参与标准制定机构的工作，这些准则旨在帮助您创建不仅在技术上可靠、同时对用户具有实际意义的元数据。

0

相关内容

大语言模型基准综述

大语言模型基准综述

专知会员服务

25+阅读 · 2025年8月22日

大模型如何领域适配？最新《领域特定基础模型概述：关键技术、应用与挑战》

大模型如何领域适配？最新《领域特定基础模型概述：关键技术、应用与挑战》

专知会员服务

56+阅读 · 2024年9月20日

重磅发布 | 《数据清洗、去标识化、匿名化业务规程（试行）》发布，51页pdf

重磅发布 | 《数据清洗、去标识化、匿名化业务规程（试行）》发布，51页pdf

专知会员服务

51+阅读 · 2023年11月18日

《数据标准管理实践白皮书》，20页pdf，中国信息通信研究院云计算与大数据研究所

《数据标准管理实践白皮书》，20页pdf，中国信息通信研究院云计算与大数据研究所

专知会员服务

51+阅读 · 2022年5月31日

【开放书】《面向自然语言处理的表示学习》，清华大学，Representation Learning for Natural Language Processing

【开放书】《面向自然语言处理的表示学习》，清华大学，Representation Learning for Natural Language Processing

专知会员服务

37+阅读 · 2022年3月24日

《金融大数据术语》行业标准，24页pdf

《金融大数据术语》行业标准，24页pdf

专知会员服务

55+阅读 · 2022年2月28日

《信息技术大数据系统基本要求》国家标准，11页pdf

《信息技术大数据系统基本要求》国家标准，11页pdf

专知会员服务

47+阅读 · 2022年2月27日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

【优化基准：最佳实践，54页pdf】Benchmarking in Optimization: Best Practice and Open Issues

【优化基准：最佳实践，54页pdf】Benchmarking in Optimization: Best Practice and Open Issues

专知会员服务

25+阅读 · 2020年7月28日

【新书】自然语言处理表示学习技术，349页pdf，清华大学

【新书】自然语言处理表示学习技术，349页pdf，清华大学

专知会员服务

174+阅读 · 2020年7月11日

推荐系统工程化落地技术点汇总

推荐系统工程化落地技术点汇总

机器学习与推荐算法

15+阅读 · 2020年7月10日

如何有效提升中文NER性能？词汇增强方法总结

如何有效提升中文NER性能？词汇增强方法总结

AINLP

25+阅读 · 2020年6月15日

最全中文自然语言处理数据集、平台和工具整理

最全中文自然语言处理数据集、平台和工具整理

深度学习与NLP

34+阅读 · 2019年6月22日

数据标注术语和规范国家标准出炉,你的写法符合规范么?

数据标注术语和规范国家标准出炉,你的写法符合规范么?

专知

17+阅读 · 2019年3月21日

入行量化，你必须知道的几点

入行量化，你必须知道的几点

深度学习与NLP

12+阅读 · 2019年3月5日

语音关键词检测方法综述【附PPT与视频资料】

语音关键词检测方法综述【附PPT与视频资料】

人工智能前沿讲习班

10+阅读 · 2019年2月2日

强化学习十大原则

强化学习十大原则

专知

12+阅读 · 2018年9月17日

技术干货 | 如何做好文本关键词提取？从三种算法说起

技术干货 | 如何做好文本关键词提取？从三种算法说起

数据猿

12+阅读 · 2018年2月12日

【NLP】十分钟快览自然语言处理学习总结

【NLP】十分钟快览自然语言处理学习总结

专知

17+阅读 · 2017年11月21日

NLP入门+实战必读：一文教会你最常见的10种自然语言处理技术（附代码）

NLP入门+实战必读：一文教会你最常见的10种自然语言处理技术（附代码）

大数据文摘

22+阅读 · 2017年11月9日

管理决策大数据分析方法与关键技术

国家自然科学基金

8+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

考虑岩石剪切局部化细观特征的Mohr—Coulomb强度修正准则

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

藏文化学术语规范化研究

国家自然科学基金

1+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

维吾尔语单元集优化关键技术研究及其在语音识别中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Arxiv

0+阅读 · 2月18日

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Arxiv

0+阅读 · 2月16日

Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

Arxiv

0+阅读 · 2月11日

Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification

Arxiv

0+阅读 · 2月7日

Standardized Methods and Recommendations for Green Federated Learning

Arxiv

0+阅读 · 1月30日

Beyond Literacy: Predicting Interpretation Correctness of Visualizations with User Traits, Item Difficulty, and Rasch Scores

Arxiv

0+阅读 · 1月28日

TableMaster: A Recipe to Advance Table Understanding with Language Models

Arxiv

0+阅读 · 1月27日

One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

Arxiv

0+阅读 · 1月15日

The "I" in FAIR: Translating from Interoperability in Principle to Interoperation in Practice

Arxiv

0+阅读 · 1月15日

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Arxiv

0+阅读 · 1月14日

VIP会员

文章信息

相关主题

相关VIP内容

大语言模型基准综述

大语言模型基准综述

专知会员服务

25+阅读 · 2025年8月22日

大模型如何领域适配？最新《领域特定基础模型概述：关键技术、应用与挑战》

大模型如何领域适配？最新《领域特定基础模型概述：关键技术、应用与挑战》

专知会员服务

56+阅读 · 2024年9月20日

重磅发布 | 《数据清洗、去标识化、匿名化业务规程（试行）》发布，51页pdf

重磅发布 | 《数据清洗、去标识化、匿名化业务规程（试行）》发布，51页pdf

专知会员服务

51+阅读 · 2023年11月18日

《数据标准管理实践白皮书》，20页pdf，中国信息通信研究院云计算与大数据研究所

《数据标准管理实践白皮书》，20页pdf，中国信息通信研究院云计算与大数据研究所

专知会员服务

51+阅读 · 2022年5月31日

【开放书】《面向自然语言处理的表示学习》，清华大学，Representation Learning for Natural Language Processing

【开放书】《面向自然语言处理的表示学习》，清华大学，Representation Learning for Natural Language Processing

专知会员服务

37+阅读 · 2022年3月24日

《金融大数据术语》行业标准，24页pdf

《金融大数据术语》行业标准，24页pdf

专知会员服务

55+阅读 · 2022年2月28日

《信息技术大数据系统基本要求》国家标准，11页pdf

《信息技术大数据系统基本要求》国家标准，11页pdf

专知会员服务

47+阅读 · 2022年2月27日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

【优化基准：最佳实践，54页pdf】Benchmarking in Optimization: Best Practice and Open Issues

【优化基准：最佳实践，54页pdf】Benchmarking in Optimization: Best Practice and Open Issues

专知会员服务

25+阅读 · 2020年7月28日

【新书】自然语言处理表示学习技术，349页pdf，清华大学

【新书】自然语言处理表示学习技术，349页pdf，清华大学

专知会员服务

174+阅读 · 2020年7月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机与战争：被忽视的环境影响及无人机保护潜力》

俄罗斯规划未来无人机驱动军队

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

《人工智能、武器与影响力：前沿模型在模拟核危机中展现复杂推理》2026最新46页报告

相关资讯

推荐系统工程化落地技术点汇总

推荐系统工程化落地技术点汇总

机器学习与推荐算法

15+阅读 · 2020年7月10日

如何有效提升中文NER性能？词汇增强方法总结

如何有效提升中文NER性能？词汇增强方法总结

AINLP

25+阅读 · 2020年6月15日

最全中文自然语言处理数据集、平台和工具整理

最全中文自然语言处理数据集、平台和工具整理

深度学习与NLP

34+阅读 · 2019年6月22日

数据标注术语和规范国家标准出炉,你的写法符合规范么?

数据标注术语和规范国家标准出炉,你的写法符合规范么?

专知

17+阅读 · 2019年3月21日

入行量化，你必须知道的几点

入行量化，你必须知道的几点

深度学习与NLP

12+阅读 · 2019年3月5日

语音关键词检测方法综述【附PPT与视频资料】

语音关键词检测方法综述【附PPT与视频资料】

人工智能前沿讲习班

10+阅读 · 2019年2月2日

强化学习十大原则

强化学习十大原则

专知

12+阅读 · 2018年9月17日

技术干货 | 如何做好文本关键词提取？从三种算法说起

技术干货 | 如何做好文本关键词提取？从三种算法说起

数据猿

12+阅读 · 2018年2月12日

【NLP】十分钟快览自然语言处理学习总结

【NLP】十分钟快览自然语言处理学习总结

专知

17+阅读 · 2017年11月21日

NLP入门+实战必读：一文教会你最常见的10种自然语言处理技术（附代码）

NLP入门+实战必读：一文教会你最常见的10种自然语言处理技术（附代码）

大数据文摘

22+阅读 · 2017年11月9日

相关论文

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Arxiv

0+阅读 · 2月18日

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Arxiv

0+阅读 · 2月16日

Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

Arxiv

0+阅读 · 2月11日

Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification

Arxiv

0+阅读 · 2月7日

Standardized Methods and Recommendations for Green Federated Learning

Arxiv

0+阅读 · 1月30日

Beyond Literacy: Predicting Interpretation Correctness of Visualizations with User Traits, Item Difficulty, and Rasch Scores

Arxiv

0+阅读 · 1月28日

TableMaster: A Recipe to Advance Table Understanding with Language Models

Arxiv

0+阅读 · 1月27日

One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

Arxiv

0+阅读 · 1月15日

The "I" in FAIR: Translating from Interoperability in Principle to Interoperation in Practice

Arxiv

0+阅读 · 1月15日

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Arxiv

0+阅读 · 1月14日

相关基金

管理决策大数据分析方法与关键技术

国家自然科学基金

8+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

考虑岩石剪切局部化细观特征的Mohr—Coulomb强度修正准则

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

藏文化学术语规范化研究

国家自然科学基金

1+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

2+阅读 · 2014年12月31日

维吾尔语单元集优化关键技术研究及其在语音识别中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员