Citizen science databases that consist of volunteer-led sampling efforts of species communities are relied on as essential sources of data in ecology. Summarizing such data across counties with frequentist-valid prediction sets for each county provides an interpretable comparison across counties of varying size or composition. As citizen science data often feature unequal sampling efforts across a spatial domain, prediction sets constructed with indirect methods that share information across counties may be used to improve precision. In this article, we present a nonparametric framework to obtain precise prediction sets for a multinomial random sample based on indirect information that maintain frequentist coverage guarantees for each county. We detail a simple algorithm to obtain prediction sets for each county using indirect information where the computation time does not depend on the sample size and scales nicely with the number of species considered. The indirect information may be estimated by a proposed empirical Bayes procedure based on information from auxiliary data. Our approach makes inference for under-sampled counties more precise, while maintaining area-specific frequentist validity for each county. Our method is used to provide a useful description of avian species abundance in North Carolina, USA based on citizen science data from the eBird database.
翻译:以志愿者主导的物种群落采样为代表的公民科学数据库,已成为生态学中不可或缺的数据来源。通过为每个县构建具有频率有效性的预测集来汇总此类空间尺度的数据,可提供对不同规模和组成县区的可解释性比较。鉴于公民科学数据通常存在空间域内采样强度不均的问题,采用跨县共享信息的间接方法构建预测集可提升精度。本文提出一种基于间接信息的非参数框架,用于获取多项随机样本的精确预测集,同时保持每个县的频率覆盖保证。我们详细阐述了一个简捷算法,通过间接信息为每个县构建预测集,其计算时间与样本量无关且随物种数量扩展性良好。间接信息可通过基于辅助数据的经验贝叶斯方法进行估计。该框架能在保持各县特定频率有效性的同时,提升弱采样县的推断精度。基于美国北卡罗来纳州eBird数据库的公民科学数据,本方法成功描述了当地鸟类物种丰度分布特征。