weatherlinguist

Cloud optimized weather data storage with zarr

I became familiar with the zarr data format late last year, thanks to a colleague who had been using it when working in academia. The beautiful thing about storing data in zarr, either locally or in the cloud, is that when you load when the usually large datasets found in weather prediction the loading part takes a few seconds at most. In zarr, the data is compressed in N-arrays which are stored in compressed chunks. This is especially useful for geospatial raster files in NetCDF or GRIB format. You can think of zarr as a large netCDF file spread over smaller files. It follows a similar hierarchical structure as the NetCDF format.

In a nutshell, what zarr does is to split the file in multiple chunks and then index these in such a way that when the data is loaded it pulls out only the indexes and the piece of data if you need when specified.

Compare this to a regular call to the Copernicus Data Store, for, say ERA5. Doing this in a Jupyter notebook using python typically involves these steps

import cdsapi
import os
c = cdsapi.Client()
year = "2024"
month = "03"
#selecting a chunk from Greenland here 
area_for_era5 = [90, -110, 55, 40] 

c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'variable': '2m_temperature',
        'year': str(year),
        'month': str(month).zfill(2),
        'day': days,
        'time': [
            '00:00', '03:00', '06:00',
            '09:00', '12:00', '15:00',
            '18:00', '21:00',
        ],
        'area': area,
        'format': 'netcdf',
    },
     'era5_t2m_'+str(year)+"{:02d}".format(month)+'.grb')

Running this is in the traditional CDS call takes about a minute (times might vary depending on how busy the service is).

2024-07-10 19:23:09,611 INFO Welcome to the CDS
2024-07-10 19:23:09,612 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels
2024-07-10 19:23:09,859 INFO Request is queued
2024-07-10 19:23:12,587 INFO Request is running
2024-07-10 19:23:42,587 INFO Request is completed
2024-07-10 19:23:42,588 INFO Downloading https://download-0000-clone.copernicus-climate.eu/cache-compute-0000/cache/data5/adaptor.mars.internal-1720632210.5132751-8758-14-978349df-f90b-4c30-9a51-394d0a9524a2.nc to era5_t2m_greenland_202403.grb (38.8M)
2024-07-10 19:24:00,289 INFO Download rate 2.2M/s                                                                                            

Doing it using the cloud optimized zarr dataset of ERA5 from Google you can pull out the whole world in one go (note you need to install the gcsfs library before).

import xarray
xarray.open_zarr('gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1',chunks=None,storage_options=dict(token="anon"),)

This gives me the temperature for the whole world in a matter of seconds.

<xarray.DataArray 'temperature' (time: 1089144, hybrid: 137, latitude: 721,
                                 longitude: 1440)>
[154918622718720 values with dtype=float32]
Coordinates:
  * hybrid     (hybrid) float32 1.0 2.0 3.0 4.0 5.0 ... 134.0 135.0 136.0 137.0
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * time       (time) datetime64[ns] 1900-01-01 ... 2024-03-31T23:00:00
Attributes: (12/22)
    GRIB_J:                          639
    GRIB_K:                          639
    GRIB_M:                          639
    GRIB_NV:                         276
    GRIB_cfName:                     air_temperature
    GRIB_cfVarName:                  t
    ...                              ...
    GRIB_stepUnits:                  1
    GRIB_typeOfLevel:                hybrid
    GRIB_units:                      K
    long_name:                       Temperature
    standard_name:                   air_temperature
    units:                           K

I have been testing the same approach in a new product recently released by dynamical, a company that aims to provide open-source code and open data repositories on the cloud. Pretty cool stuff! For someone who has been dealing with massive amounts of grib and netcdf files to extract data by hand using eccodes or xarray this is a huge improvement in data access. And you don't need to store it locally to begin with!