Machine learning techniques are attractive options for developing highly-accurate automated analysis tools for nanomaterials characterization, including high-resolution transmission electron microscopy (HRTEM). However, successfully implementing such machine learning tools can be difficult due to the challenges in procuring sufficiently large, high-quality training datasets from experiments. In this work, we introduce Construction Zone, a Python package for rapidly generating complex nanoscale atomic structures, and develop an end-to-end workflow for creating large simulated databases for training neural networks. Construction Zone enables fast, systematic sampling of realistic nanomaterial structures, and can be used as a random structure generator for simulated databases, which is important for generating large, diverse synthetic datasets. Using HRTEM imaging as an example, we train a series of neural networks on various subsets of our simulated databases to segment nanoparticles and holistically study the data curation process to understand how various aspects of the curated simulated data -- including simulation fidelity, the distribution of atomic structures, and the distribution of imaging conditions -- affect model performance across several experimental benchmarks. Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles from several experimental benchmarks and, further, we discuss robust strategies for consistently achieving high performance with machine learning in experimental settings using purely synthetic data.
翻译:机器学习技术为开发纳米材料表征(包括高分辨率透射电子显微镜,HRTEM)的高精度自动化分析工具提供了极具吸引力的方案。然而,由于从实验中获取足够大规模、高质量的训练数据集存在挑战,成功实现此类机器学习工具往往困难重重。本文介绍了Construction Zone——一个用于快速生成复杂纳米级原子结构的Python软件包,并开发了一套端到端的工作流程,旨在创建用于训练神经网络的大型模拟数据库。Construction Zone能够快速、系统地对真实纳米材料结构进行采样,并可作为模拟数据库的随机结构生成器,这对于生成大规模、多样化的合成数据集至关重要。以HRTEM成像为例,我们在模拟数据库的多个子集上训练一系列神经网络来分割纳米颗粒,并全面研究数据整理过程,以理解整理后的模拟数据(包括模拟保真度、原子结构分布及成像条件分布)的各个层面如何影响模型在多个实验基准上的性能。基于我们的研究结果,我们在来自多个实验基准的纳米颗粒实验HRTEM图像上实现了最先进的分割性能;此外,我们进一步讨论了在实验环境中仅利用纯合成数据持续获得机器学习高性能的鲁棒性策略。