The VeCAP-Tool

Rita Glowienka-Hense$*$, Sophie Stolzenberger$\u2020$, Christopher Frank$\u2021$, Andreas Hense$?$ |

Meteorologisches Institut, Universität Bonn |

Version from March 16, 2015

This documentation gives a brief description of the VeCAP-Tool. VeCAP also provides a
stand-alone version of its evaluation package on the MiKlip server. The path to the directory
is /miklip/home/b/b305002/vecap. There you can find the latest VeCAP-version and the
corresponding documentation.

If there are any bugs, please contact us!

### 1 Overview

With this tool, it is possible to download the needed data and, subsequently, to calculate necessary tests for predictability such as ANOVA and Analysis rank histograms/Talagrand-diagrams/ and a number of scores/skill scores and indices (Fig. 1). In Section 2, it is described how the hindcast and historical experiments can be obtained from the DKRZ archive. It is explained how certain lead years or running averages can be selected. The statistical methods are explained in Section 3. Section 4 shows the technical details.

### 2 Model and observational data

First, the original model data and verifying observations are obtained from the DKRZ archive. For this step it is important to select the desired baseline experiments (b0, b1, hist), model resolution (LR, MR) and subsystem (atmosphere or ocean). It is possible to intend the analyses for horizontal fields at different vertical levels and zonally averaged fields. The seleceted variables can be also detrended.

Furthermore, single lead years and multi-year averages can be selected. The scripts are intended to form time series from the model data which allow to compare model prediction and analysis data for a series of prediction ensembles. Variables from the same lead year of consecutively started predictions are merged to form a time series that can be compared with observation. For each start year an ensemble of predictions is provided: 3 runs for b0 experiments and 10 runs for b1 experiments. In the analyses the predictions called run 1 are combined to form the run 1 ensemble member of the combined predictions. The combination of different runs from different base year ensembles would be as meaningful because each ensemble member of each base year is disturbed independently. Fig. 2 shall explain the data formation.

The data are prepared for our statistical analyses and stored locally (output directory).

### 3 Statistical Methods

#### 3.1 Analysis of variance (anova/bianova)

The analysis of variance measures the variance due treatments relative to the total variance. This factor is also called coefficient of multiple determination. An unbiased estimator is used following Storch and Zwiers (1999). The significance of this parameter is estimated under the assumption of the F-distribution. In the tool the lead years are considered as treatments. The data show potential predictability if the variability between the ensemble means of the predicted lead years makes up a major part of the total variability. For the bianova the runs are treated as a second treatment. Significant results would indicate marked differences between the models used for the different runs.

#### 3.2 Root mean square error (rms)

Under the title of rms the following indicators are calculated whereby the rms is separated into two parts. The following statistics are calculated:

- Mean observations
- Standard deviation observations
- Mean model data
- Standard deviation model data
- Mean difference $diff$: model minus observation(Bias would be squared difference but with loss of information). Confidence test: Student-t-test.
- Quotient ${\sigma}_{obs}\u2215{\sigma}_{model}$(=ensemble mean) of standard deviation.
- Correlation $corr$
between model and observational time series. Confidence test: Student-t-test.
- Correlation shows the optimal predictability of observations by the model data.
- It can be achieved by using a regression model.
- Here $diff$ is the mean correction term and ${\sigma}_{obs}\u2215{\sigma}_{model}*corr$ the regression coefficient.

- A Mean Square Error skill score ($msess$)
following Murphy and Epstein (1988) has been calculated.
- In their 1989 article the authors do not yet profit from ensemble calculations.
- Their equation 1-4 is used, here averaging is however over ensembles and not over grid points.
- The score compares the mean square difference between predicted values and observation
- with the mean square difference between all model values and observation:
- $msess=1$ for perfect prediction.
- $0<msess<=1$ if prediction errors are smaller than the error between model climate and abservations
- $msess<0$ errors are larger if model prediction is used instead of climate
- In the following $f,c$ denote forcasted and climate values respectively.
- $msess:=1-\frac{\stackrel{?}{mse\left(f\right)}}{\stackrel{?}{mse\left(c\right)}}$
- $mse\left(f,t\right):=\frac{1}{nrun}{\sum}_{irun=1}^{nrun}{\left(f\left(irun,t\right)-obs\left(t\right)\right)}^{2}$
- $mse\left(c,t\right):=\frac{1}{nrun}{\sum}_{irun=1}^{nrun}\left(\frac{1}{nt}{\sum}_{it=1}^{nt}{\left(c\left(it,irun\right)-obs\left(t\right)\right)}^{2}\right)$
- $\stackrel{?}{mse\left(\cdot \right)}:=\frac{1}{nt}{\sum}_{it=1}^{nt}mse\left(\cdot ,it\right)$

#### 3.3 Sea level pressure indices (index)

This section titled index is a collection of slp indices calculated from model and observational data. Similar to the above analysis using grid points anova and rms statistics are performed. The results are shown as tables. The indices are picked out following a list of Kaplan (2011). The intention is to analyse physical processes with respect to their potential predictability. The following indices area calculated.

- Southern oscillation indices: mean square difference between predicted values and
observation
- Standardized slp difference Tahiti-Darwin
- Standardized slp Darwin
- Equatorial SOI

- NAO indices:
- Standardized slp differences Lisbon-Stykkisholmur
- Gibraltar-Reykjavikmean square difference between predicted values and observation
- A pc-based index over the Atlantic sector (20${}^{\circ}$N-80${}^{\circ}$N,
90${}^{\circ}$W-40${}^{\circ}$E)

Deviating from Hurrels (1995) suggestion, the pc are calculated from the correlation matrix and all months of the year are stringed togethter. This reduces the noise for small data series and localizes the pattern.

- North Pacific Index NPI
- SLP ($3{0}^{\circ}-6{5}^{\circ}$N, $16{0}^{\circ}$E-$14{0}^{\circ}$W)

- Antarctic Oscillation (AAO) indices
- zonal mean SLP difference $4{0}^{\circ}$S minus $6{5}^{\circ}$S
- zonal mean SLP difference $4{0}^{\circ}$S minus $7{0}^{\circ}$S

#### 3.4 Analysis rank histograms or Talagrand-diagrams (talagrand)

In addition to the ANOVA, further necessary tests are the Talagrand-diagrams (also known as analysis rank histograms). This is a graphical method to evaluate if the spread of the ensemble is too small (U-shape of the histogram), too large (inverse U-shape) or just perfect (flat histogram). The PDF of a $\beta $-distribution can be fitted to the histograms and out of the required parameters for this distribution a $\beta $-score can be calculated (Keller and Hense, 2011).

#### 3.5 $\beta $-score (betascore)

The $\beta $-scores (see section 3.4) are calculated at every gridpoint and are displayed as cards and as a function of latitude. In this case the 5%- and the 95%-quantile is calculated with the bootstrap method.

#### 3.6 Continuous ranked probability skill score (crpss)

The CRPS is defined as a certain distance between the predictive CDF and the CDF of the observation, the so-called Heaviside function. Thus the CRPS is the integral of the Brier Score at all possible threshold values (Hersbach, 2000). We use the analytical solution of the integral for standard Gaussian after Gneiting et al. (2005). The CRPS is illustrated at every gridpoint. We get a skill score when the CRPS is related to the CRPS climate. These skill scores are illustrated at every gridpoint and in dependence on the latitude. Additonally, the robustness of the CRPSS is calculated by bootstrapping (sample size is 500). The p-value for CRPSS of zero is estimated. If this p-value is smaller than 0.05 (larger than 0.95), we call the CRPSS to be robust in terms of its positive (negative) skill.

#### 3.7 Ensemble spread score (ess)

The ESS is the ratio of the ensemble spread and the ensemble forecast error (Keller and Hense, 2011) and is calculated at every gridpoint and as a function of latitude.

#### 3.8 Principal Components with multivariate Analysis (eof)

Using the input data lona,lone,lata,late a sector of the input data is selected. The model data for the different runs and the observational data are stringed to one data set. Internally the time axis is reset. The bias between model and observational data is corrected. The time series are normed by their common standard deviation. Then the eigenmodes from the correlation matrix are calculated for each month using cdo. The eigenmodes are stored in a scratch directory. In a further step a time dependent $msess\left(t\right)$ is calculated from the mode amplitudes. The approach is the same as for the grid point analysis which is described in the rms module except for the time averaging. Again positive values show that the model predictions are closer to the respective observation (analysis) than the model climate. This is especially helpful for observations which are equal to the climate mean. In this case, the ensemble mean can be correct either because the model predicts this value or otherwise because the model is no longer forced by external parameters or initial conditions and thus the average over a large ensemble is just the climate mean. In this case, however, the spread of the ensemble should be larger than for a real prediction. Tables show the squared correlation between model and observed eof amplitudes.

#### 3.9 Reliability classifications (reliab)

Reliability diagrams describe the relation between forecasted probabilities and observed relative frequencies (for further details see e.g. Br?cker (2009)). To calculate the observed relative frequencies, binary events out of the observations have to be estimated (default setting is the exceedance of the median). The slope $m$ of the reliability diagram is classified into 3 categories: the forecasts are reliable ($m=1$), potentially reliable ($0.5<m<1.5$) or unreliable ($m<0.5\vee m>1.5$). Additonally, the Brier skill score (reference forecast predictability is p_thr, default: p_thr=0.5) and the area under ROC curves is calculated.

### 4 Input

#### 4.1 How to get the data?

In Tab. 1 the different settings are shown to get the data and prepare them for further
analyses. The default settings are done for b1-LR with 10 ensemble runs. The selected
variable is the geopotential height (zg) at 850, 500 and 200 hPa for prediction year 2-5
(1982-2012). Z contains the trend. ERA-Interim reanalysis is used for verifying the
simulations.

The selection of other vertical levels is also possible. The unit is Pascal (Pa).

#### 4.2 How to set the parameters?

The data can be modified by choosing certain months and/or segments of the global maps (but they will not be saved as extra files). The number of gridpoints for the original fields is necessary. If there are missing values it is also required to fill them in. Tab. 2 summarises these features.

#### 4.3 How to start the calculations?

Tab.3 and 4 show the input section for the application of the statistical methods. As all calculations are turned off (default) the user has to activate the test which is of interest. For all tests, the bias correction is performed (default setting which can be turned off). It is possible to start several calculations.

#### Example: how to start?

In the first step, the evaluation system has to be loaded:

module load evaluation_system

In the second step, the VeCAP-Too can be started:

analyse --tool vecap run_anova=yes run_rms=yes

In this case, ANOVA and RMS-statistics will be calculated for zg in 850, 500 and 200 hPa, basis is b1-LR with 10 member etc. (see default settings).

cache | cache directory |

default: /scratch/user/evaluation_system/vecap | |

outdir_path | output directory |

default: /scratch/user/evaluation_system/output/vecap/ | |

b0orb1 | choose baseline experiment, options: ’b0’, ’b1’ |

default: b1 | |

yz | verical cross section or horizontal field; options: ’yes’, ’no’ |

default: no | |

hind_cast_or_hist | choose hindcasts or historical ensemble; options: ’hind_cast’, ’hist’ |

default: hind_cast | |

lmormr | choose model resolution; options: ’LR’, ’MR’ |

default: LR | |

sphere | choose options: ’atmos’, ’ocean’ |

default: atmos | |

ja | choose start year of analysis, e.g. boundary condition for time period of ERA-Interim (1979-2012): min(ja)=1979+nt-1 |

default: 1982 | |

je | choose end year of analysis, e.g. boundary condition for time period of ERA-Interim (1979-2012): max(je)=2012 |

default: 2012 | |

leadyear | choose first predicted year to be selected; options:’1’,...,’10’ |

default: 2 | |

nt | number of years to be averaged starting with selected year |

default: 4 | |

nrun | number of ensemble runs |

default: 10 | |

variable | choose variable |

default: zg | |

sellev | select vertical levels (’yes’) or surface (’no’); options: ’yes’, ’no’ |

default: yes | |

levels | choose levels (in Pa); example for vertical cross sections: 100000,92500,85000,70000,60000,50000,40000,30000, |

25000,20000,15000,10000,7000,5000,3000,2000,1000 | |

default: 20000,50000,85000 | |

detrend | choose ’yes’ or ’no’ |

default: no | |

observations | choose ’yes’ or ’no’ if observations are available |

default: yes | |

nx | number of gridpoints in x-direction; in case of yz=yes nx is forced to 1 |

default: 192 | |

ny | number of gridpoints in y-direction |

default: 96 | |

lona | western longitude for selecting a field |

default: 0 | |

lone | eastern longitude for selecting a field |

default: 360 | |

lata | southern latitude for selecting a field |

default: -90 | |

late | northern latitude for selecting a field |

default: 90 | |

xmiss | value for missing data to fill |

default: 1.e20 | |

mona | start month, e.g. 1 = January |

default: 1 | |

mone | end month, e.g. 12 = December; if mona = 13, the annual mean is also calculated |

default: 12 | |

nbmon | number of months per year which are in the model data files and in the observational data file |

default: 12 | |

xmin | minimum of values |

default: -1.e20 | |

xmax | maximum of values |

default: 1.e20 | |

run_anova | performs analysis of variance (ANOVA) |

default: no | |

run_yzanova | performs analysis of variance (ANOVA) for height-cross sections of zonally averaged variables (yz) |

default: no | |

run_bianova | performs bivariate analysis of variance |

default: no | |

run_rms | performs RMS-Statistics (mean square error skill score, correlations, ...) |

default: no | |

run_yzrms | performs RMS-Statistics (mean square error skill score, correlations, ...) for height-cross sections of |

zonally averaged variables (yz) | |

default: no | |

run_slp_index | performs slp index tests. It is only possible for the variable sea level pressure (psl). |

default: no | |

run_eof | performs EOF (empirical orthogonal function) analyses |

default: no | |

run_yz_eof | performs EOF (empirical orthogonal function) analyses for height-cross sections of zonally averaged variables (yz) |

default: no | |

run_talagrand | performs talagrand diagrams (i.e. analysis rank histograms) |

default: no | |

bias_corr_talagrand | If talagrand diagrams are perfomed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_yztalagrand | performs talagrand diagrams (i.e. analysis rank histograms) for height-cross sections of zonally averaged variables (yz) |

default: no | |

bias_corr_yztalagrand | If talagrand diagrams for yz are perfomed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_betascore | performs betascores as a function of latitude and for every gridpoint |

default: zg | |

bias_corr_bs | If betascores are perfomed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_yzbetascore | performs betascores for height-cross sections of zonally averaged variables (yz)’) |

default: no | |

bias_corr_yzbs | If betascores (yz) are perfomed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: no | |

run_crpss | performs continuous ranked probability score (crps) for every gridpoint and its skill score (crpss) for every |

gridpoints and as a function of latitude | |

default: no | |

bias_corr_crps | If crpss is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_yzcrpss | performs continuous ranked probability score (crps) for every gridpoint and its skill score (crpss) for every |

gridpoints for height-cross sections of zonally averaged variables (yz). | |

default: no | |

bias_corr_yzcrps | If crpss (yz) is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_ess | performs ensemble spread score (ess) for every gridpoint and as a function of latitude |

default: no | |

bias_corr_ess | If ess is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_yzess | performs ensemble spread score (ess) for every gridpoint for height-cross sections of zonally averaged variables (yz) |

default: no | |

bias_corr_yzess | If ess (yz) is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

run_reliab | performs reliability classifications out of the reliability diagram, brier skill score and area under ROC curve for |

height-cross sections of zonally averaged variables - anaylses for every gridpoint | |

default: no | |

bias_corr_reliab | If reliab is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

p_thr | If reliab is performed, a threshold value between 0 and 1 has to be defined to calculate forecast probabilities (out of |

the ensemble) and binary events (out of the observations) | |

default: 0.5 | |

run_yzreliab | performs reliability classifications out of the reliability diagram, brier skill score and area under ROC curve for |

height-cross sections of zonally averaged variables (yz) - anaylses for every gridpoint | |

default: no | |

bias_corr_yzreliab | If reliab (yz) is performed, model data can be bias corrected beforehand. Choose ’yes’ or ’no’ |

default: yes | |

yz_p_thr | If reliab (yz) is performed, a threshold value between 0 and 1 has to be defined to calculate forecast probabilities (out of |

the ensemble) and binary events (out of the observations) | |

default: 0.5 | |

### References

Bröcker, J.: Reliability, sufficiency, and the decomposition of proper scores,Quarterly Journal of the Royal Meteorological Society,135,643,1512–1519,2009.

Gneiting, T. and Raftery, A.E. and Westveld III, A.H. and Goldman, T.: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation, Monthly Weather Review, 133,5,1098–1118,2005.

Hersbach, H.: Decomposition of the continuous ranked probability score for ensemble prediction systems,Weather and Forecasting,15,5,559–570,2000.

Keller, J.D. and Hense, A.: A new non-Gaussian evaluation method for ensemble forecasts based on analysis rank histograms,Meteorologische Zeitschrift,20,2,107–117,2011.

Murphy, A. and Epstein E.S.: Skill Scores and Correlation Coefficients in Model Verification,Monthly Weather Review,vol 117,p572-581.

Sardeshmukh P. D. and Hoskins B. J.: Spatial Smoothing on the Sphere, Monthly Weather Review 112, p2524-2529,1984.

Schulzweida, U., Kornblueh L. and Quast R.: CDO Climate Data Operators User’s Guide, Version 1.5.2, August 2011, MPI Hamburg

von Storch, H. and Zwiers, F.: Statistical Analysis in Climate Research,Cambridge University Press, 484pp,1999.