The verification (validation) of weather forecasts

26 Sep, 2024

An important part of my work as numerical modeler is to check that the simulation I am running produces something sensible. This usually involves comparing the output of the numerical solver against some validation data set or benchmark. I will be focusing on physical numerical prediction models (NWPs), but these days data driven models are becoming increasingly popular.

There are two aspects which are important when determining if the computer code you wrote is doing something sensible. The first one is to check if the physical model faithfully represents the physics you are trying to model. The second one is to compare your results with experiments. In the jargon of computational fluid dynamics (CFD) one calls the first part verification (are we solving the equations right?) and the second part validation (are we solving the right equations?), as described in any of Oberkamp's seminal papers on this topic.

Although the term validation is used at times, in numerical weather prediction and atmospheric sciences most people use the term verification with the meaning of validation. In NWP the numerics (ie, the verification part in CFD) of every operational weather model has been thoroughly checked before the model is used on a regular basis (plus weather models have been developed over many decades of research). At this point in time I think we are pretty sure of what are the correct equations for modelling the weather!

The simplest form of model verification is point verification, and it usually involves looking at a time series of observations in a series of fixed locations (usually synoptic weather stations) and comparing the weather model output at the same locations. One of the most comprehensive descriptions of the different scores one can calculate can be found in Beth Ebert'S verification website at the Australian's Bureau of Meteorology. This website has long list of the most frequent verification scores you will encounter in meteorology. I usually look at things like bias and standard deviation as a function of forecast hour to get a general feeling of how bad the weather forecast becomes with time. I want to see such scores as small as possible.

verif1 Example from Tambke et al of bias and standard deviation.

verif1 Example of operational verification for some of the canadian models, compared to a bunch of global models (source: Canadian met service)

As an end-user , of course, it is irrelevant if the scores say that the model is good. As a normal person, you want to know if it is going to rain today at your place, or if it is going to be windy, too hot, etc. One of the difficulties modelling the weather with computers is that the atmosphere is a chaotic dynamic system (see this nice explanation from ECMWF), and small errors in the initial conditions can grow very fast, limiting the skill of numerical weather models to forecast the future.

One way to improve weather predictions using a physical NWP is to increase the spatial or temporal resolution of the model or adding more observations in the data assimilation step. Counter-intuitively, using a higher resolution model can actually lead to worse scores (in the point-by-point verification sense described above). As described in this blog post, a forecast of rain fall might be correct in terms of intensity, size, and timing, but completely off on location, giving very bad statistical scores (many misses and false alarms, in an statistical sense). They call this the "double penalty problem" in weather forecast verification, since a coarser resolution model will give better scores just because the "computer cells" used by the model are larger, and hence the verification results will look better. Recent initiatives like the Destination Earth project focus on running very high resolution (100s of meters) simulations on demand, which are triggered by extreme weather events that are harder to predict.

For the person going out for a walk at noon, who sees the weather app telling him "it will rain at 2", and then thinking he is safe to go out without the umbrella and getting wet instead is a small consolation to think: "oh well, the model was wrong on the timing, but not the location!". The weather modeller siting on his desk will feel on the other hand very satisfied by his prediction, since he sees a low score on the verification plot. Just have a a bit of understanding for the people working at your weather service, this is a difficult pretty problem to crack!