Assessing Accuracy and Ensemble Spread with MurCSS

 Henning W. Rust$\ast$, Madlen Fischer$†$ Chistopher Kadow, Sebastian Illing, Oliver Kunst Institut für Meteorologie, Freie Universität Berlin

Version from April 10, 2014

$\ast$hrust@met.fu-berlin.de
$†$madlen.ﬁscher@met.fu-berlin.de

Please note that this documentation is currently under construction, particularly Sections 3 and 4 have not been ﬁnished yet. Comments or any kind of feedback is highly appreciated. Please send an e-mail to the authors.

### 1 Introduction

This document brieﬂy reviews and extends the ideas for verifying decadal prediction presented by Goddard et al. [2013] and describes how the plugin MurCSS of the MiKlip Central Evaluation System (CES) can be used to assess hindcast skill accordingly.

Following Goddard et al. [2013], two key questions can be identiﬁed which should guide the quality assessment of a decadal ensemble prediction system:

1. Is the initialized hindcast more accurate compared to the un-initialized projection?
2. Is the models’ ensemble spread representative for the forecast uncertainty?

A way to tackle the ﬁrst issue is to estimate accuracy of the ensemble mean with the mean squared error skill score (MSESS) and its decomposition, covered in Sec. 2. The second question can be addressed using a particular variant of the continuous ranked probability skill score (CRPSS), see Sec. 3. The basic ideas underlying these two approaches stem from traditional forecast veriﬁcation, for an overview, see, e.g., [Wilks2011] and [Jolliﬀe and Stephenson2012]. The decomposition of the mean squared error skill score goes back to Murphy [1988]. Here, however, we use these methods for forecast veriﬁcation to assess the quality of an ensemble prediction system by verifying a set of hindcasts against a reanalysis or another data set derived from observations.

In section 2, the assessment of accuracy of the ensemble mean with the mean square error skill score is described. Section 3 discusses the representation of forecast uncertainty with the ensemble spread and Sec. 4 explains input and output of the component MurCSS and gives an example analysis.

### 2 Accuracy with the mean square error skill score

#### 2.1 Mean square error and skill scores

The accuracy of the ensemble mean can be assessed with the mean squared error (MSE). To evaluate the added value of initialising the ensemble prediction system with an estimated state of the ocean (and the atmosphere), performance is compared an uninitialised set of predictions; in the context of climate research uninitialised predictions are called climate projections or climate simulations and are typically driven by scenarios for the radiative forcing (or emissions) [Kirtman et al.2013]. This setting calls for using skill scores, which evaluate the performance of one prediction system against a reference prediction, typically a more or less trivial benchmark [e.g., Wilks2011]. In weather forecast, this trivial benchmark is the climatological forecast, the persistent forecast or a random forecast. The appropriate benchmark for decadal prediction is the uninitialised prediction (climate projection). For hindcasts, the “climate projection” is called a historical run. A skill score evaluates the gain in performance against the trivial benchmark. The general expression for a skill score is

 ${SS}_{ref}=\frac{A-{A}_{ref}}{{A}_{perf}-{A}_{ref}}\cdot 100%,$ (1)

and is typically given in percent gain with respect to a reference forecast. Here, $A$ is the value for the accuracy measure (here the MSE, see below) for the prediction system under question, ${A}_{perf}$ is its value for a perfect prediction and ${A}_{ref}$ is its value for a reference forecast system. I the prediction system in question obtains the same accuracy as the perfect prediction, it will be assigned 100% skill; skill of 0% or less is interpreted as being equal or worse than the (typically trivial) reference system.

In the following, the mean square error MSE is considered as a more or less intuitive (or at least familiar) score to be the basis of the skill score. The mean square error (MSE) between a hindcast $H$ and observations (reanalysis) $O$ is deﬁned as the sum of the squared deviations of the two at instances (times) $j=1,\dots ,n$, and reads following the notation of Goddard et al. [2013]

 ${MSE}_{H}=\frac{1}{n}\sum _{j=1}^{n}{\left({H}_{j}-{O}_{j}\right)}^{2}.$ (2)

The minimum value for the MSE is 0, indicating a (hypothetical) perfect prediction system, leading to ${A}_{perf}=0$ in Eq. (1). The upper bound of the MSE depends on whether ${O}_{j}$ (and ${H}_{j}$) is bounded, which is not the case for many continuous variables, leading to no upper bound for the MSE.

Using the MSE (Eq. (2)) as a skill score (Eq. (1)) with the climatological forecast $Ō$ as the reference leads to

 $MSESS\left(H,Ō,O\right)=\frac{MS{E}_{H}-MS{E}_{Ō}}{0-MS{E}_{Ō}}=1-\frac{MS{E}_{H}}{MS{E}_{Ō}},$ (3)

with the MSE of the climatological forecast

 ${MSE}_{Ō}=\frac{1}{n}\sum _{j=1}^{n}{\left(Ō-{O}_{j}\right)}^{2}.$ (4)

The MSESS represents the improvement in accuracy of the hindcasts over the climatological forecast derived from the observations. It is unbounded for negative values (as hindcasts can be arbitrarily bad) but bounded by 1 (if not expressed as percentages) for positive values. It is a function of the hindcast $H$ (which is the ensemble mean in our case), the reference forecast, here the climatological forecast $Ō$, and the observations $O$ themselves, as they enter both, $MS{E}_{H}$ and $MS{E}_{Ō}$. A positive value of the MSESS indicates gain in hindcast performance compared to the reference forecast; zero or negative values indicates no gain.

It follows that the MSESS against a general reference forecast, e.g., the uninitialised historical runs $R$, reads

 $MSESS\left(H,R,O\right)=1-\frac{MS{E}_{H}}{MS{E}_{R}}.$ (5)

In the following, decompositions of the MSE (Eq. (2)) and its skill score (Eqs. (4) and (5)) are discussed.

#### 2.2 Decomposing the mean square error

Rewriting of the MSE (Eq. (2)) or the MSESS (Eqs. (4) and (5)) leads to elegant decompositions of the MSE and its skill scores into a few contributing terms with intuitive interpretations.

Consider ﬁrst the decomposition of the MSE: inserting a zero by adding and substracting the mean hindcast over all predictions issued (all times) $\stackrel{̄}{H}=\frac{1}{n}{\sum }_{j=1}^{n}{H}_{j}$ and the mean observation $Ō=\frac{1}{n}{\sum }_{j=1}^{n}{O}_{j}$ to the right hand side (rhs) of Eq. (2) yields

 ${MSE}_{H}=\frac{1}{n}\sum _{j=1}^{n}{\left[\left({H}_{j}-\stackrel{̄}{H}\right)-\left({O}_{j}-Ō\right)+\left(\stackrel{̄}{H}-Ō\right)\right]}^{2}$ (6)

. Expanding Eq. (6) into single terms leads to $\begin{array}{lllllll}\hfill {MSE}_{H}& =\underset{={s}_{H}^{2}}{\underbrace{\frac{1}{n}\sum _{j=1}^{n}{\left({H}_{j}-H\right)}^{2}}}-\underset{=2{s}_{HO}}{\underbrace{\frac{2}{n}\sum _{j=1}^{n}\left[\left({H}_{j}-\stackrel{̄}{H}\right)\left({O}_{j}-Ō\right)\right]}}+\underset{=0}{\underbrace{\frac{2\left(\stackrel{̄}{H}-Ō\right)}{n}\sum _{j=1}^{n}\left({H}_{j}-\stackrel{̄}{H}\right)}}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill \\ \hfill & \phantom{\rule{2em}{0ex}}\phantom{\rule{1em}{0ex}}\underset{=0}{\underbrace{-\frac{2\left(\stackrel{̄}{H}-Ō\right)}{n}\sum _{j=1}^{n}\left({O}_{j}-Ō\right)}}+\underset{={s}_{O}^{2}}{\underbrace{\frac{1}{n}\sum _{j=1}^{n}{\left({O}_{j}-Ō\right)}^{2}}}+{\left(\stackrel{̄}{H}-Ō\right)}^{2},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill \text{(7)}\end{array}$

with the third and fourth term being zero because the sum of all anomalies are zero. The ﬁrst term represents the sample variance of the hindcasts ${s}_{H}^{2}$, the second term is the sample covariance of hindcasts and observations ${s}_{HO}$, the next-to-last term is the sample variance of the observations ${s}_{O}^{2}$ and the last term is the square of the bias $Bias=\stackrel{̄}{H}-Ō$. A shorthand notation of the above equation is

 ${MSE}_{H}={\left(\stackrel{̄}{H}-Ō\right)}^{2}+{s}_{H}^{2}+{s}_{O}^{2}-2{s}_{HO}.$ (8)

Using ${s}_{HO}={s}_{H}{s}_{O}{r}_{HO}$ and thus the sample correlation between the hindcasts and the observations ${r}_{HO}$, Eq. (8) yields

 ${MSE}_{H}={\left(\stackrel{̄}{H}-Ō\right)}^{2}+{s}_{H}^{2}+{s}_{O}^{2}-2{s}_{H}{s}_{O}{r}_{HO}$ (9)

In the following, the MSE decomposition Eq. (9) is plugged into the skill score Eq. (1) to yield the decomposition of the MSESS.

#### 2.3 Decomposing the mean square error skill score

The decomposition Eq. (9) is inserted into the MSESS with the climatological forecast $Ō$ as reference (Eq. (4)). Using ${MSE}_{Ō}={s}_{O}^{2}$, i.e. the MSE of the climatological forecast being an estimate of the climatological variance, we obtain

 $MSESS\left(H,Ō,O\right)=-{\left[\frac{\stackrel{̄}{H}-Ō}{{s}_{O}}\right]}^{2}-{\left[\frac{{s}_{H}}{{s}_{O}}\right]}^{2}+2\frac{{s}_{H}}{{s}_{O}}{r}_{HO}$ (10)

Adding and subtracting ${r}_{HO}^{2}$ and some transformations the MSESS reads

 $MSESS\left(H,Ō,O\right)={r}_{HO}^{2}-{\left[{r}_{HO}-\frac{{s}_{H}}{{s}_{O}}\right]}^{2}-{\left[\frac{\stackrel{̄}{H}-Ō}{{s}_{O}}\right]}^{2}.$ (11)

The ﬁrst term of Eq. (11) is the square of the sample correlation coeﬃcient, a dimensionless measure of the linear relationship between the hindcasts and the observations. The second term is the conditional bias squared, a measure of reliability. The last term represents the bias or unconditional bias; the latter can be removed by analysing anomalies. The MSESS with the bias removed reads thus

 $MSESS\left(H,Ō,O\right)={{r}_{HO}}^{2}-{\left[{r}_{HO}-\frac{{s}_{H}}{{s}_{O}}\right]}^{2},$ (12)

hence being basically decomposed into correlation and conditional bias. For the hindcast $H$ against the climatological forecast $Ō$, we denote the left-hand-side (lhs) of Eq. (12) as ${MSESS}_{H}=MSESS\left(H,Ō,O\right)$ and, analogously for another forecast $R$ against the climatological forecast $Ō$ as ${MSESS}_{R}=MSESS\left(R,Ō,O\right)$.

A similar decomposition can be obtained for a general reference from Eq. (5) by multiplying and subtracting ${s}_{o}^{2}$. After a few transformations, it reads

 $MSESS\left(H,R,O\right)=\frac{{{r}_{HO}}^{2}-{\left[{r}_{HO}-\frac{{s}_{H}}{{s}_{O}}\right]}^{2}-{{r}_{RO}}^{2}+{\left[{r}_{RO}-\frac{{s}_{R}}{{s}_{O}}\right]}^{2}}{1-{{r}_{RO}}^{2}+{\left[{r}_{RO}-\frac{{s}_{R}}{{s}_{O}}\right]}^{2}}.$ (13)

Thus, the MSESS of the hindcast against a reference forecast can be described with the MSESS of the hindcast against the climatological forecast and the MSESS of the reference prediction against the climatological forecast

 $MSESS\left(H,R,O\right)=\frac{MSES{S}_{H}-MSES{S}_{R}}{1-MSES{S}_{R}}.$ (14)

In the following, the interpretation of the various terms of this decomposition are discussed.

#### 2.4 Interpretation of the MSESS and its decomposition

##### 2.4.1 MSESS

The MSESS is an integral measure of the improvement in accuracy of the hindcasts $H$ over the climatological forecast $Ō$ or another reference forecast $R$ with respect to the observations (reanalysis) $O$. The MSESS can take arbitrary large negative values (as hindcast performance can be arbitrarily bad) but it is bounded from by 1 for positive values. Positive values indicate a gain in accuarcy for the hindcast compared to the reference; zero or a negative values indicate no skill.

##### 2.4.2 Correlation

The correlation coeﬃcient represents a dimensionless measure of the linear relationship between hindcast (or reference) and the observations, ranging from -1 to 1. In the case of the MSESS comparing the hindcast against the climatological forecast the component MurCSS of the central evaluation system (CES) yields the correlation coeﬃcient ${r}_{HO}$ between $H$ and $O$. In the case of an analysis of the hindcast against another reference prediction the gain in correlation ${r}_{HO}-{r}_{RO}$ is shown.

##### 2.4.3 Conditional Bias

The conditional bias is a measure of the reliability of the hindcast/reference prediction, i.e. it gives information about the distribution of the observation given the hindcast, $p\left(o\mid h\right)$ [see, e.g., Wilks2011]. Its meaning can be illustrated with a linear regression of the hindcast against the observations, see Fig. 1. The slope of this regression line can be expressed by the correlation coeﬃcient and the standard deviations $m={s}_{O}∕{s}_{H}\cdot {r}_{HO}$. Thus, the conditional bias gives an indication of the ratio of the variances of hindcast and observation. Or in other words: given the hindcasts, it gives information about the distribution of the observations. Figure 1 gives an example for a positive correlation coeﬃcient.

For a positive correlation coeﬃcient a positive/negative conditional bias indicates a smaller/greater variance of the hindcast compared to the observations. For a negative correlation this relationship is reversed.

The component MurCSS of the central evaluation system (CES) gives for the analysis of the hindcast $H$ against a reference prediction $R$ the gain in the conditional bias

 $\left|{r}_{HO}-\frac{{s}_{H}}{{s}_{O}}\right|-\left|{r}_{RO}-\frac{{s}_{R}}{{s}_{O}}\right|$ (15)

#### 3.1 Continuous ranked probabiliy score

Additionally to the question whether the ensemble mean is an accurate prediction, it rests to clarify whether the ensemble spread is an appropriate representation of the forecast uncertainty. To this end, Goddard et al. [2013] suggested to employ the continuous ranked probability skill score (CRPSS). The CRPSS is a skill score based on the continuous ranked probability score (CRPS), a quadratic measure in probability space, [e.g., Wilks2011].

 $CRPS={\int }_{-\infty }^{\infty }{\left[F\left(y\right)-{F}_{0}\left(y\right)\right]}^{2},$ (16)

with $F\left(y\right)$ being the cumulative distribution function (CDF) derived from the hindcast and, as the observations are not assigned a probability, ${F}_{0}\left(y\right)$ is the Heaviside (or switch) function. The Heavyside function is deﬁned as

 ${F}_{0}\left(y\right)=\Theta \left(x-y\right)\left\{\begin{array}{cc}0\hfill & \hfill y (17)

and basically jumps from 0 to 1 at the observation $y$. CDFs of hypothetical hindcasts and the Heaviside function are schematically depicted in Fig. 2(left). The other two panels of Fig. 2 illustrates how the CRPS works, the middel panel gives the diﬀerence of the two forecasts $F\left(y\right)$ and an observation ${F}_{𝜃}\left(y\right)$, and the right panel shows the squared diﬀerence.

For this example, we deliberately chose two forecasts distributions with the same mean but diﬀerent variance to illustrate that the CRPS is sensitive to the variance of the forecasted probability distribution. For an ensemble prediction system, the forecast distribution is derived from a discrete (and often small) ensemble. This can be done by, e.g., deriving an empirical cumulative distribution function (ECDF) based on the ranks of the individual members, or assuming an underlying Gaussian distribution and estimate its mean and variance, or by various other methods, such as ensemble dressing [Broecker and Smith2008]. Goddard et al. [2013] suggested to assume the Gaussian hypothesis and use the parametric form of the CRPS derived by Gneiting and Raftery [2007]. For a hindcast ensemble with mean ${Ĥ}_{j}$ and variance ${\sigma }_{{Ĥ}_{j}}^{2}$ at hindcast dates $j$, it reads

 $CRPS\left({Ĥ}_{j},{\sigma }_{{Ĥ}_{j}}^{2},{O}_{j}\right)={\sigma }_{{Ĥ}_{j}}\left[\frac{1}{\sqrt{\pi }}-2\varphi \left(\frac{{O}_{j}-{Ĥ}_{j}}{{\sigma }_{{Ĥ}_{j}}}\right)-\frac{{O}_{j}-{Ĥ}_{j}}{{\sigma }_{{Ĥ}_{j}}}\left(2\Phi \left(\frac{{O}_{j}-{Ĥ}_{j}}{{\sigma }_{{Ĥ}_{j}}}\right)-1\right)\right],$ (18)

with $\varphi \left(.\right)$ and $\Phi \left(.\right)$ denoting the Gaussian probability density and distribution function, respectively. This setting allows to assess the ensemble spread for the hypothetical case of an unbiased forecast: the ensemble mean ${Ĥ}_{j}$ is set to the climatological mean derived from the observations, ${Ĥ}_{j}=Ō$.

#### 3.2 Continuous ranked probability skill score

As the goal is to assess whether the ensemble spread is a useful measure for the forecast uncertainty, we compare the ensemble forecast to a reference forecast with the desired spread. Goddard et al. [2013] suggested an unbiased forecast with spread equal to the MSE between the ensemble mean and the observations. The latter is a valid representation of the forecast uncertainty. Hence, the CRPSS for this particular case reads

 ${CRPSS}_{ES}=1-\frac{{CRPS}_{H}\left(Ō,{\sigma }^{2},O\right)}{{CRPS}_{MSE}\left(Ō,{MSE}_{H},O\right)},$ (19)

and it basically compares the ensemble spread ${\sigma }^{2}$ to the ${MSE}_{H}$. Please note that for the desired case ${\sigma }^{2}={MSE}_{H}$, the ratio on the rhs is 1 and the ${CRPSS}_{ES}=0$. Thus, for the case that the ensemble spread compares well to the ${MSE}_{H}$ (our measure for forecast uncertainty), i.e. for the optimal case, the skill score used here attains 0 not 1, as one would typically expect. Figure 3 illustrates this behaviour based on a simulation study with simple stochastic processes emulating the observations and hindcasts.

[ENSEMBLE SPREAD SCORE HAS BEEN RECENTLY IMPLEMENTED BUT IS NOT DESCRIBED HERE YET.]

### 4 Veriﬁcation with MurCSS

#### 4.1 Input

The plugin MurCSS for the CED calculates the MSESS and its decomposition. This sections describes the various options for the plugin. It is useful to search the data for the analysis beforehand with sol_search to reduce the possibilities of the options. Table 1 list and describes all possible options.

 Output Output directory default: /pf/b/user/evaluation_system/output/murcss Output plots Output directory of the produced plots, default: /pf/b/user/evaluation_system/plotst/murcss Decadals Specify the experiments you want to use. I.e. 1960,1965,1970,..,1995. mandatory Or you can write 1960:5:1995. To exclude a year (for example 1970) you can write 1960:5:1995-1970 Variable The name of the variable you want to analyze (CMOR). The website mandatory provides a select list with a search function. Project1 That could be BASELINE1, BASELINE0 or CMIP5. It is also possible mandatory to specify a custom project if the username is included to the evaluation system. Product1 Product to Project1, part of the datapath mandatory Institute1 Institute to Project1, part of the datapath mandatory Model1 Model to Project1, part of the datapath mandatory Experiment1 Experiment to Project1 I.e. ”decs4e” or ”decadal”, part of the datapath mandatory Project2 Project which is compared to Project1. That could be BASELINE1, mandatory BASELINE0 or CMIP5. It is also possible to specify a custom project if the username is included to the evaluation system. Product2 Product to Project2, part of the datapath mandatory Institute2 Institute to Project2, part of the datapath mandatory Model2 Model to Project2, part of the datapath mandatory Experiment2 Experiment to Project2 I.e. ”decs4e” or ”decadal”, part of the datapath mandatory Leadtimes Leadtimes to analyze, comma separated list, time range can be mandatory hyphenated, default: 1,2-9 Observation Observations to compare, i.e. merra or hadcrut. It is also possible mandatory to specify a path of an observation ﬁle. Ensemblemembers Here you can specify a subset of the ensemblemembers you want to use. optional Insert a comma separated list, for example r1i1p1,r2i1p1,r3i1p1 Signiﬁcance Whether you want to calculate signiﬁcance levels (95%) using optional bootstraps method. WARNING This could take up to 1 day! Bootstrap number Number of bootstrap runs, using if Signiﬁcance is true optional default: 500 Level Here you can choose a certain height for the analysis, for example optional 700 000 (Pa). At the moment only one level can be speciﬁed. Tip: Find out common levels in observation and project beforehand. Lonlatbox Here you can specify a region. If you want to select the region optional 100W-40E and 20N-80N than you have to write -100,40,20,80 Maskmissingvalues Whether you want to mask grid points with missing values in the mandatory observation ﬁle or not, default: TRUE Cache Path of the working directory mandatory default: $USER_CACHE_DIR/$SYSTEM_TIMESTAMP Result grid You can specify a gridﬁle or a grid description like r20x25 optional

Table 1: Options for MurCSS

#### 4.2 Output

Here, the output of the plugin is described, for comparing hindcasts against the climatological forecast (Sec. 4.2.1) and hindcasts versus another reference forecast (Sec. 4.2.2).

##### 4.2.1 Hindcast vs. climatological forecast

Table 2 lists the output obtained for the analysis of the hincasts versus the climatological forecast.

 msess Mean Squared Error Skill Score according to equation 12 rmss Root of the Mean Squared Error Skill Score according to equation 12 correlation correlation coeﬃcient ${r}_{HO}$ as a measure of the linear relationship between the hindcast and the observations conditional bias measure of the reliability, ${r}_{HO}-\frac{{s}_{H}}{{s}_{O}}$ std_ratio ratio of the standard deviations of hindcast and observations $\frac{{s}_{O}}{{s}_{H}}$ (part of conditional bias) biasslope slope of the linear regression between hindcast and observations $m={s}_{O}∕{s}_{H}\cdot {r}_{HO}$, conditional bias=0 if m=1 biasslope-1 slope of the linear regression between hindcast and observations minus 1 $m={s}_{O}∕{s}_{H}\cdot {r}_{HO}-1$, conditional bias=0 if m=0

Table 2: Output obtained for the comparison of hindcasts versus the climatological forecast.

##### 4.2.2 Hindcast vs. reference prediction

Table 3 lists the output obtained for the analysis of the hincasts versus another reference forecast, e.g., the uninitialised historical runs.

 msess Mean Squared Error Skill Score calculated with equation 13 rmss Root of the Mean Squared Error Skill Score calculated with equation 13 correlation gain in correlation of the hindcast with respect to the reference prediction ${r}_{HO}-{r}_{RO}$ conditional bias gain in conditional bias of the hindcast with respect to the reference prediction $|{r}_{RO}-\frac{{s}_{R}}{{s}_{O}}|-|{r}_{HO}-\frac{{s}_{H}}{{s}_{O}}|$ biasSkillScore represents the improvement of the conditional bias by the slope of the linear regression, $1-\frac{|{m}_{HO}-1|}{|{m}_{RO}-1|}$. Positive values indicate a higher conditional bias with respect to the reference prediction and negative values the opposite. biasslope-1 gain in conditional bias representated by the slope of the linear regression minus 1 of the hindcast with respect to the reference prediction $|{m}_{HO}-1|-|{m}_{RO}-1|$

Table 3: Output obtained for the comparison of hindcasts versus another reference forecast.

#### 4.3 Example analysis

This section gives an example analysis with the plugin MurCSS. Suppose, we want to estimate the skill of baseline1 against baseline0 of the near-surface air temperature tas with the HadCrut data set as observations. First, solr_search identiﬁes the respective data series. For baseline1 the command looks like

solr_search data_type=baseline1 variable=tas time_frequency=mon model=MPIESMLR

The output is a list of the paths of all available data ﬁles, for example:

/miklip/integration/data4miklip/model/global/miklip/ baseline1/output/
MPI-M/MPI-ESM-LR/decs4e1960/mon/atmos/tas/r1i1p1/
tas_Amon_MPI-ESM-LR_decs4e1960_r1i1p1_196101-197012.nc

With the knowledge of the directory structure we have made the following adjustments for the analysis Baseline1 vs. Baseline0 :

 Decadals 1960:5:1995 Variable tas Project1 baseline1 Product1 output Institute1 mpi-m Model1 mpi-esm-lr Experiment1 decs4e Project2 baseline0 Model2 mpi-esm-lr Experiment2 decadal Leadtimes 2-5 Observation HadCrut Signiﬁcance TRUE

After the analysis is completed the netcdf-ﬁles and the produced plots (section 4.2) are available. Some of these plots are depicted in Fig. 4 for the analysis Baseline0/Baseline1 vs. Climatology.

In the analysis with the standard deviation of the observations as reference the MSESS describes how well the observations can be reproduced. A positive value indicates a good reproduction of the observations with a perfect hindcast at 1 and negative values the opposite. The near-surface air temperature is better represented by Baseline1 especially in the indian ocean and southeast asia with respect to Baseline0. But both models can not reproduce the observations in the paciﬁc ocean suﬃciently.

The MSESS can be decomposed into the correlation and the conditional bias (see Sec. 2.3). High postive values of the correlation indicate a good accordance of the model and the observations. High negative values suggest a strong linear relationship as well, but the diﬀerent signs show a poor accordance of model and observations. The improved MSESS of Baseline1 with respect to Baseline0 is mainly produced by a higher correlation coeﬃcient at most of the grid points.

The observations can be best represented with a conditional bias of 0. In Fig. 4 it can be seen that Baseline1 shows the lowest conditional bias, but there are still some areas, for example the paciﬁc ocean and the nothern atlantic, with a pronounced conditional bias.

Another approach to understand the conditional bias and its relationsship to the standard deviations of the hindcast and the observations is the slope of the linear regression. In Fig. 4 the biasslope -1 is depicted as well. It is noticable that the pattern are quite similar to the pattern of the conditional bias. For a positive correlation coeﬃcient a positive slope and a positive conditional bias suggest a higher standard deviation of the observations with respect to the standard deviation of the hindcast. If the correlation coeﬃcient is negative a positive slope and conditional bias indicate a higher standard deviation of the hindcast. For the near-surface air temperatur the correlation and the conditional bias is positive in Central Europe for Baseline1 and Baseline0, which suggest a higher standard deviation of the observations. The negative correlation and conditional bias over the indian ocean in Baseline0 indicate a higher standard deviation of the observations as well, but in Baseline1 the variations of the hindcast is higher due to the positive correlation and the negative conditional bias.

The ratio of the standard deviations ${S}_{O}∕{S}_{H}$ is also shown in Fig. 4.

In Fig. 5 the output of MurCSS is depicted for the analysis Baseline1 vs. Baseline0. A positive value of the MSESS suggest a improved accuracy of the hindcast compared to the reference and a negative value indicates the opposite.

In Fig. 5 a) it can be seen, that Baseline1 is more accurate in most of the regions. Only a few small areas, for example in South America or the Labbrador Sea the accordance of Baseline0 and the observations is higher. Fig. 5 b) shows the gain of the correlation coeﬃcient of Baseline1 compared to Baseline0. Positive values indicate a improved correlation of Baseline1. The analysis of the near-surface air temperature show an improved accuracy of Baseline1 with respect to Baseline0 for almost every gridpoint as well. In Fig. 5 c) the gain of the conditional bias is illustrated. Here it is noticable, that negative values represent a reduction of the bias and thus a improved accuarcy. For example in the Labrador Sea the conditional bias is higher in Baseline0, the correlation is, however, slightly increased. Allover the contribution of the bias overbalance the eﬀect of the correlation, thus the MSESS shows a decreased accuracy in the Labrador Sea.

An additional approach to illustrate the conditional bias is the gain of the biasslope -1, which is shown in Fig. 5 d). At this ﬁgure the negative values suggest a reduction of the conditional bias as well. Thus the pattern is quite similar to Fig. 5 c). Due to the calculation of the gain with absolute values ($|{m}_{HO}-1|-|{m}_{RO}-1|$) a statement about the gain in the standard deviations is no longer possible.

A third method the illustrate the condiational bias is the bias Skill Score $1-\frac{|{m}_{HO}-1|}{|{m}_{RO}-1|}$, Fig. 5 e). This time, positive values suggest a higher conditional bias of Baseline1 compared to Baseline1 and negative values the opposite. Nevertheless, the Figs. 5 c)-e) illustrating the conditional bias have similar patterns.

### References

J. Broecker and L. A. Smith. From ensemble forecasts to predictive distribution functions. Tellus A, 60(4):663 ﬀ, 2008.

T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc., 102(477):359–378, 2007.

L. Goddard, A. Kumar, A. Solomon, D. Smith, G. Boer, P. Gonzalez, V. Kharin, W. Merryﬁeld, C. Deser, S.J. Mason, B.P. Kirtman, R. Msadek, R. Sutton, E. Hawkins, T. Fricker, G. Hegerl, C.A.T. Ferro, D.B. Stephenson, G.A. Meehl, T. Stockdale, R. Burgman, A.M. Greene, Y. Kushnir, M. Newman, J. Carton, I. Fukumori, and T. Delworth. A veriﬁcation framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40:245–272, 2013. ISSN 0930-7575. doi: 10.1007/s00382-012-1481-2. URL http://dx.doi.org/10.1007/s00382-012-1481-2.

I. T. Jolliﬀe and D. B. Stephenson, editors. Forecast Veriﬁcation. Wiley-Blackwell, second edition, 2012.

B. Kirtman, J. A. Adedoyin S. B. Power and, G. J. Boer, R. Bojariu, I. Camilloni, F. J. Doblas-Reyes, , A. M. Fiore, M. Kimoto, G. A. Meehl, M. Prather, A. Sarr, C. Schär, R. Sutton, G.J. van Oldenborgh, G. Vecchi, and H. J. Wang. Near-term climate change: Projections and predictability. In T. F. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex, and P.M. Midgley, editors, Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2013.

A. H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coeﬃcient. Mon. Wea. Rev., 116(12):2417–2424, December 1988. ISSN 0027-0644. URL http://dx.doi.org/10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.

D. S. Wilks. Statistical methods in the atmospheric sciences. Academic Press, San Diego, CA, 3rd edition, 2011.