大数据方法在牛类生物声学中的应用：一个符合FAIR原则的数据集与可扩展的机器学习框架，用于精准畜牧业动物福利评估 (Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare)

The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed augmentation. This FAIR compliant resource addresses major Big Data challenges - volume (90 hours of recordings, 65.6 GB), variety (multi farm and multi zone acoustics), velocity (real time processing), and veracity (noise robust feature extraction). Our distributed processing framework integrates advanced denoising using iZotope RX, multimodal synchronization through audio and video alignment, and standardized feature engineering with 24 acoustic descriptors generated from Praat, librosa, and openSMILE. Preliminary benchmarks reveal distinct class level acoustic patterns for estrus detection, distress classification, and maternal communication. The datasets ecological realism, reflecting authentic barn acoustics rather than controlled settings, ensures readiness for field deployment. This work establishes a foundation for animal centered AI, where bioacoustic data enable continuous and non invasive welfare assessment at industrial scale. By releasing standardized pipelines and detailed metadata, we promote reproducible research that connects Big Data analytics, sustainable agriculture, and precision livestock management. The framework supports UN SDG 9, showing how data science can turn traditional farming into intelligent, welfare optimized systems that meet global food needs while upholding ethical animal care.

翻译：物联网传感、边缘计算与机器学习的融合正在变革精准畜牧业。然而，由于计算复杂性和生态效度挑战，生物声学数据流仍未得到充分利用。我们提出了迄今为止最全面的牛类发声数据集之一，包含569个经过筛选的音频片段，涵盖48个行为类别，通过在多座商业奶牛场使用多个麦克风阵列录制，并经由领域知识引导的数据增强扩展至2900个样本。这一符合FAIR原则的资源解决了大数据的主要挑战——数据体量（90小时录音，65.6 GB）、多样性（多农场多区域声学数据）、速度（实时处理）与真实性（抗噪声特征提取）。我们的分布式处理框架集成了基于iZotope RX的先进降噪技术、通过音视频对齐实现的多模态同步，以及标准化特征工程——使用Praat、librosa和openSMILE生成24种声学描述符。初步基准测试揭示了用于发情检测、应激分类和母性交流的不同类别层面的独特声学模式。数据集具有生态真实性，反映了真实牛舍声学环境而非受控设置，确保了现场部署的可行性。这项工作为以动物为中心的人工智能奠定了基础，其中生物声学数据使得在工业规模上实现连续、非侵入式的动物福利评估成为可能。通过发布标准化处理流程和详细元数据，我们推动了连接大数据分析、可持续农业与精准畜牧管理的可重复研究。该框架支持联合国可持续发展目标9，展示了数据科学如何将传统养殖业转变为智能、福利优化的系统，在满足全球粮食需求的同时，恪守伦理化的动物关怀。