Data engineering for meteorology

08 Aug, 2024

This is a bit of a rehash of an old post in my other blog, but since I will be using bearblog for the time being I will move all those posts here. The vast majority of my time these days is spent fiddling with data transfers from one machine to another and using this data to

Following this guide, I started reviewing what is the current data stack used by data engineers. I consider most of my work to fall broadly in the field of data engineering (DE), but without the common data stack found in most DE enterprises. However, there are some important differences when you work with the massive amount of gridded data found in meteorology. One important difference is that in meteorology the data is vastly larger than in any usual so-called "data warehouse" system. The tools used to extract data from weather models are highly specialized. One example that most users familiar with the data from ECMWF weather model is the MARS database (Meteorological Archival and Retrieval System).

Technical differences aside, I wanted to get familiar with the usual data engineering stack, and translate the usual data engineering jargon to something I am more familiar with. One of the acronyms that are always thrown around in data engineering discussions is ETL, which stands for Extract, Transform, Load, the three main steps in a data pipeline. That is: Extracting data from various sources, Transforming it into a suitable format, and then Loading it into a target database or data warehouse for analysis.

The most common format one finds when people refer to weather data is a time series of some meteorological variable like temperature, wind speed or pressure at a specific location. This is usually the output of a weather model, which means the data already passed through some previous Extraction pipeline to convert it from one of the common meteorological formats (grib or netcdf) and then interpolate it to a specific location or converting the data to different units (ie, Transform it).

Storing 4D weather data ingrib/netcdf format in a common database like PostgreSQL or MySQL is probably not feasible, but I wanted to know if there are any standards out there on how to deal with storing this sort of data in a data engineering context. I found some discussion about the use of GeoServer (an open source server for sharing geospatial data) for this, and apparently this supports the storage of grib files. Another common cloud storage format widely used today is the zarr data format, [as I mentioned in a previous post(https://weatherlinguist.bearblog.dev/cloud-optimized-weather-data-storage-with-zarr/).

An important topic in data engineering is data streaming. For single point data like temperature at specific locations data streaming with services like kafka is probably a good idea, but for massive data sets some more specialized tools have been developed, like the aviso system at ECMWF.

Digging more on this topic and how it is approached in the context of atmospheric sciences, I found really good discussion in this paper paper, that describes the Pangeo Forge project, an open source platform for ETL processes of geospatial data on the cloud.