duckdb, ibis and weather data

22 Jul, 2024

As a data scientist, most of my daily coding is in python, and I regularly try to keep up to date with new data science libraries, especially those that can handle geospatial data

I have been hearing a lot about duckdb lately, and I recently watched a very interesting talk from the last GeoPython conference about duckdb by one of the developers. This ties neatly with another library I heard about recently, and I am also using now: ibis, described as the "the portable dataframe library". It definitely adds multiple functionalities to your data frame pipelines. It can handle anything from sqlite to duckdb, polars and pyspark.

The most common data format for observational data in meteorology is the bufr format. It is like the observational equivalent to the grib format for weather models. One of my colleagues does the processing of this format and others like ODB, and dumps miriads of ascii files that contain observations for wind speed, temperature and other typical variables. I then have to convert these to other database formats like sqlite to be ingested by a model verification software like harp or MET. I have been using parquet files to store all those little files in one go, but it looks like duckdb would be a more efficient option.

One of the first interesting applications I've seen of duckdb is quering a database with a mind blowingly long sql command (source code provided by Benjamin Gutzmann), that produces a plot for air temperature over Germany for the whol period of observations available in the library wetterdienst. The original post from earth observations in linkedin is here.

WITH 
vls_raw AS (
SELECT
station_id,
date,
YEAR(date) AS yr,
value
FROM values
WHERE
parameter = 'temperature_air_mean_200'
),
vls_per_year AS (
SELECT
station_id,
yr
FROM
vls_raw
GROUP BY
station_id, yr
HAVING
COUNT_IF(value IS NOT NULL) >= 360
),
vls_filtered AS (
SELECT
station_id,
yr,
value AS temp
FROM
vls_raw
JOIN
vls_per_year
USING
(station_id, yr)
),
vls_agg AS (
SELECT
station_id,
yr,
AVG(temp) AS avg_temp,
COUNT_IF(temp IS NOT NULL) AS n
FROM
vls_filtered
GROUP BY
station_id, yr
),
stations_per_year AS (
SELECT
yr,
COUNT(station_id) AS n_stations
FROM
vls_agg
GROUP BY
yr
),
vls_agg_filtered AS (
SELECT
vls_agg.*
FROM
vls_agg
JOIN
stations_per_year
USING
(yr)
WHERE
n_stations >= 10
AND
yr BETWEEN 1881 AND 2021
),
lm AS (
SELECT
regr_intercept(avg_temp, yr) AS intercept,
regr_slope(avg_temp, yr) AS slope,
FROM 
vls_agg_filtered
),
stats AS (
SELECT
yr,
QUANTILE_CONT(avg_temp, 0.05) AS q5,
MEDIAN(avg_temp) AS median,
QUANTILE_CONT(avg_temp, 0.95) AS q95,
COUNT(avg_temp) AS n_stations,
CAST(SUM(n) AS INT64) AS n_values
FROM
vls_agg_filtered
GROUP BY
yr
ORDER BY
yr
)
SELECT
stats.*,
lm.intercept + lm.slope * stats.yr AS trend,
FROM stats
JOIN lm
ON 
TRUE

Attempting to test this on my own, I used the following code to pullout the data

mport urllib.request
url = "https://github.com/earthobservations/wetterdienst/releases/download/dwd_obs_daily_climate_summary_2024_05_20/dwd_obs_daily_climate_summary.duckdb"
filename, headers = urllib.request.urlretrieve(url, filename="dwd_obs_daily_climate_summary.duckdb")
print ("download complete!")

and then the following code to generate a similar plot

import matplotlib.pyplot as plt
import time

filename = "dwd_obs_daily_climate_summary.duckdb"

sql_query = """
WITH 
vls_raw AS (
SELECT
station_id,
date,
YEAR(date) AS yr,
value
FROM values
WHERE
parameter = 'temperature_air_mean_200'
),
vls_per_year AS (
SELECT
station_id,
yr
FROM
vls_raw
GROUP BY
station_id, yr
HAVING
COUNT_IF(value IS NOT NULL) >= 360
),
vls_filtered AS (
SELECT
station_id,
yr,
value AS temp
FROM
vls_raw
JOIN
vls_per_year
USING
(station_id, yr)
),
vls_agg AS (
SELECT
station_id,
yr,
AVG(temp) AS avg_temp,
COUNT_IF(temp IS NOT NULL) AS n
FROM
vls_filtered
GROUP BY
station_id, yr
),
stations_per_year AS (
SELECT
yr,
COUNT(station_id) AS n_stations
FROM
vls_agg
GROUP BY
yr
),
vls_agg_filtered AS (
SELECT
vls_agg.*
FROM
vls_agg
JOIN
stations_per_year
USING
(yr)
WHERE
n_stations >= 10
AND
yr BETWEEN 1881 AND 2021
),
lm AS (
SELECT
regr_intercept(avg_temp, yr) AS intercept,
regr_slope(avg_temp, yr) AS slope,
FROM 
vls_agg_filtered
),
stats AS (
SELECT
yr,
QUANTILE_CONT(avg_temp, 0.05) AS q5,
MEDIAN(avg_temp) AS median,
QUANTILE_CONT(avg_temp, 0.95) AS q95,
COUNT(avg_temp) AS n_stations,
CAST(SUM(n) AS INT64) AS n_values
FROM
vls_agg_filtered
GROUP BY
yr
ORDER BY
yr
)
SELECT
stats.*,
lm.intercept + lm.slope * stats.yr AS trend,
FROM stats
JOIN lm
ON 
TRUE
"""
con = duckdb.connect(filename)
tart_time = time.time()
result = con.execute(sql_query)
query_time = time.time() - start_time
print(f"Total execution time for the long query: {query_time:.4f} seconds")
# Convert the result to a pandas DataFrame
df = result.df()

# Plot the shaded area for 5-95 percentile
plt.fill_between(df['yr'], df['q5'], df['q95'], color='gray', alpha=0.3)

# Plot the median temperature
plt.plot(df['yr'], df['median'], color='blue', linewidth=1, label='median')

# Plot the trend line
plt.plot(df['yr'], df["trend"], color='red', linewidth=2, label='trend')

# Customize the plot
plt.title('Temperature trend (K) 1866 to 2023', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Temperature (K)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()# Set y-axis limits
plt.ylim(276, 286)

# Add source text
plt.text(0.02, 0.02, 'Source: Deutscher Wetterdienst', transform=plt.gca().transAxes,
         fontsize=8, verticalalignment='bottom')

plt.tight_layout()
plt.savefig('dwd_plot.png')
con.close()

This produces the image below. dwd_plot

The whole sql command took 1.5 to 3 seconds depending on the attempt. That is performance. I tried loading the whole 1.2 GB file in a pandas dataframe and my laptop froze.