A robust synthetic data generation framework for machine learning in High-Resolution Transmission Electron Microscopy (HRTEM)

Machine learning techniques are attractive options for developing highly-accurate automated analysis tools for nanomaterials characterization, including high-resolution transmission electron microscopy (HRTEM). However, successfully implementing such machine learning tools can be difficult due to the challenges in procuring sufficiently large, high-quality training datasets from experiments. In this work, we introduce Construction Zone, a Python package for rapidly generating complex nanoscale atomic structures, and develop an end-to-end workflow for creating large simulated databases for training neural networks. Construction Zone enables fast, systematic sampling of realistic nanomaterial structures, and can be used as a random structure generator for simulated databases, which is important for generating large, diverse synthetic datasets. Using HRTEM imaging as an example, we train a series of neural networks on various subsets of our simulated databases to segment nanoparticles and holistically study the data curation process to understand how various aspects of the curated simulated data -- including simulation fidelity, the distribution of atomic structures, and the distribution of imaging conditions -- affect model performance across several experimental benchmarks. Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles from several experimental benchmarks and, further, we discuss robust strategies for consistently achieving high performance with machine learning in experimental settings using purely synthetic data.

翻译：机器学习技术为开发纳米材料表征（包括高分辨率透射电子显微镜，HRTEM）的高精度自动化分析工具提供了极具吸引力的方案。然而，由于从实验中获取足够大规模、高质量的训练数据集存在挑战，成功实现此类机器学习工具往往困难重重。本文介绍了Construction Zone——一个用于快速生成复杂纳米级原子结构的Python软件包，并开发了一套端到端的工作流程，旨在创建用于训练神经网络的大型模拟数据库。Construction Zone能够快速、系统地对真实纳米材料结构进行采样，并可作为模拟数据库的随机结构生成器，这对于生成大规模、多样化的合成数据集至关重要。以HRTEM成像为例，我们在模拟数据库的多个子集上训练一系列神经网络来分割纳米颗粒，并全面研究数据整理过程，以理解整理后的模拟数据（包括模拟保真度、原子结构分布及成像条件分布）的各个层面如何影响模型在多个实验基准上的性能。基于我们的研究结果，我们在来自多个实验基准的纳米颗粒实验HRTEM图像上实现了最先进的分割性能；此外，我们进一步讨论了在实验环境中仅利用纯合成数据持续获得机器学习高性能的鲁棒性策略。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日