Assessing Accuracy and Ensemble Spread with MurCSS

Henning W. Rust, Madlen Fischer
Chistopher Kadow, Sebastian Illing, Oliver Kunst
Institut für Meteorologie, Freie Universität Berlin

Version from April 10, 2014


hrust@met.fu-berlin.de
madlen.fischer@met.fu-berlin.de

Please note that this documentation is currently under construction, particularly Sections 3 and 4 have not been finished yet. Comments or any kind of feedback is highly appreciated. Please send an e-mail to the authors.

1 Introduction

This document briefly reviews and extends the ideas for verifying decadal prediction presented by Goddard et al. [2013] and describes how the plugin MurCSS of the MiKlip Central Evaluation System (CES) can be used to assess hindcast skill accordingly.

Following Goddard et al. [2013], two key questions can be identified which should guide the quality assessment of a decadal ensemble prediction system:

  1. Is the initialized hindcast more accurate compared to the un-initialized projection?
  2. Is the models’ ensemble spread representative for the forecast uncertainty?

A way to tackle the first issue is to estimate accuracy of the ensemble mean with the mean squared error skill score (MSESS) and its decomposition, covered in Sec. 2. The second question can be addressed using a particular variant of the continuous ranked probability skill score (CRPSS), see Sec. 3. The basic ideas underlying these two approaches stem from traditional forecast verification, for an overview, see, e.g., [Wilks2011] and [Jolliffe and Stephenson2012]. The decomposition of the mean squared error skill score goes back to Murphy [1988]. Here, however, we use these methods for forecast verification to assess the quality of an ensemble prediction system by verifying a set of hindcasts against a reanalysis or another data set derived from observations.

In section 2, the assessment of accuracy of the ensemble mean with the mean square error skill score is described. Section 3 discusses the representation of forecast uncertainty with the ensemble spread and Sec. 4 explains input and output of the component MurCSS and gives an example analysis.

2 Accuracy with the mean square error skill score

2.1 Mean square error and skill scores

The accuracy of the ensemble mean can be assessed with the mean squared error (MSE). To evaluate the added value of initialising the ensemble prediction system with an estimated state of the ocean (and the atmosphere), performance is compared an uninitialised set of predictions; in the context of climate research uninitialised predictions are called climate projections or climate simulations and are typically driven by scenarios for the radiative forcing (or emissions) [Kirtman et al.2013]. This setting calls for using skill scores, which evaluate the performance of one prediction system against a reference prediction, typically a more or less trivial benchmark [e.g., Wilks2011]. In weather forecast, this trivial benchmark is the climatological forecast, the persistent forecast or a random forecast. The appropriate benchmark for decadal prediction is the uninitialised prediction (climate projection). For hindcasts, the “climate projection” is called a historical run. A skill score evaluates the gain in performance against the trivial benchmark. The general expression for a skill score is

SSref = A Aref Aperf Aref 100%, (1)

and is typically given in percent gain with respect to a reference forecast. Here, A is the value for the accuracy measure (here the MSE, see below) for the prediction system under question, Aperf is its value for a perfect prediction and Aref is its value for a reference forecast system. I the prediction system in question obtains the same accuracy as the perfect prediction, it will be assigned 100% skill; skill of 0% or less is interpreted as being equal or worse than the (typically trivial) reference system.

In the following, the mean square error MSE is considered as a more or less intuitive (or at least familiar) score to be the basis of the skill score. The mean square error (MSE) between a hindcast H and observations (reanalysis) O is defined as the sum of the squared deviations of the two at instances (times) j = 1,,n, and reads following the notation of Goddard et al. [2013]

MSEH = 1 n j=1n(H j Oj)2. (2)

The minimum value for the MSE is 0, indicating a (hypothetical) perfect prediction system, leading to Aperf = 0 in Eq. (1). The upper bound of the MSE depends on whether Oj (and Hj) is bounded, which is not the case for many continuous variables, leading to no upper bound for the MSE.

Using the MSE (Eq. (2)) as a skill score (Eq. (1)) with the climatological forecast Ō as the reference leads to

MSESS(H,Ō,O) = MSEH MSEŌ 0 MSEŌ = 1 MSEH MSEŌ, (3)

with the MSE of the climatological forecast

MSEŌ = 1 n j=1n(Ō O j)2. (4)

The MSESS represents the improvement in accuracy of the hindcasts over the climatological forecast derived from the observations. It is unbounded for negative values (as hindcasts can be arbitrarily bad) but bounded by 1 (if not expressed as percentages) for positive values. It is a function of the hindcast H (which is the ensemble mean in our case), the reference forecast, here the climatological forecast Ō, and the observations O themselves, as they enter both, MSEH and MSEŌ. A positive value of the MSESS indicates gain in hindcast performance compared to the reference forecast; zero or negative values indicates no gain.

It follows that the MSESS against a general reference forecast, e.g., the uninitialised historical runs R, reads

MSESS(H,R,O) = 1 MSEH MSER . (5)

In the following, decompositions of the MSE (Eq. (2)) and its skill score (Eqs. (4) and (5)) are discussed.

2.2 Decomposing the mean square error

Rewriting of the MSE (Eq. (2)) or the MSESS (Eqs. (4) and (5)) leads to elegant decompositions of the MSE and its skill scores into a few contributing terms with intuitive interpretations.

Consider first the decomposition of the MSE: inserting a zero by adding and substracting the mean hindcast over all predictions issued (all times) H̄ = 1 n j=1nHj and the mean observation Ō = 1 n j=1nOj to the right hand side (rhs) of Eq. (2) yields

MSEH = 1 n j=1n[(H j H̄) (Oj Ō) + (H̄ Ō)]2 (6)

. Expanding Eq. (6) into single terms leads to MSEH = 1 n j=1n(H j H)2 =sH2 2 n j=1n[(H j H̄)(Oj Ō)]=2sHO + 2(H̄ Ō) n j=1n(H j H̄)=0 2(H̄ Ō) n j=1n(O j Ō)=0 + 1 n j=1n(O j Ō)2 =sO2 + (H̄ Ō)2, (7)

with the third and fourth term being zero because the sum of all anomalies are zero. The first term represents the sample variance of the hindcasts sH2, the second term is the sample covariance of hindcasts and observations sHO, the next-to-last term is the sample variance of the observations sO2 and the last term is the square of the bias Bias = H̄ Ō. A shorthand notation of the above equation is

MSEH = (H̄ Ō)2 + s H2 + s O2 2s HO. (8)

Using sHO = sHsOrHO and thus the sample correlation between the hindcasts and the observations rHO, Eq. (8) yields

MSEH = (H̄ Ō)2 + s H2 + s O2 2s HsOrHO (9)

In the following, the MSE decomposition Eq. (9) is plugged into the skill score Eq. (1) to yield the decomposition of the MSESS.

2.3 Decomposing the mean square error skill score

The decomposition Eq. (9) is inserted into the MSESS with the climatological forecast Ō as reference (Eq. (4)). Using MSEŌ = sO2, i.e. the MSE of the climatological forecast being an estimate of the climatological variance, we obtain

MSESS(H,Ō,O) = H̄ Ō sO 2 sH sO 2 + 2sH sOrHO (10)

Adding and subtracting rHO2 and some transformations the MSESS reads

MSESS(H,Ō,O) = rHO2 r HO sH sO 2 H̄ Ō sO 2. (11)

The first term of Eq. (11) is the square of the sample correlation coefficient, a dimensionless measure of the linear relationship between the hindcasts and the observations. The second term is the conditional bias squared, a measure of reliability. The last term represents the bias or unconditional bias; the latter can be removed by analysing anomalies. The MSESS with the bias removed reads thus

MSESS(H,Ō,O) = rHO2 r HO sH sO 2, (12)

hence being basically decomposed into correlation and conditional bias. For the hindcast H against the climatological forecast Ō, we denote the left-hand-side (lhs) of Eq. (12) as MSESSH = MSESS(H,Ō,O) and, analogously for another forecast R against the climatological forecast Ō as MSESSR = MSESS(R,Ō,O).

A similar decomposition can be obtained for a general reference from Eq. (5) by multiplying and subtracting so2. After a few transformations, it reads

MSESS(H,R,O) = rHO2 rHO sH sO 2 rRO2 + rRO sR sO 2 1 rRO2 + rRO sR sO 2 . (13)

Thus, the MSESS of the hindcast against a reference forecast can be described with the MSESS of the hindcast against the climatological forecast and the MSESS of the reference prediction against the climatological forecast

MSESS(H,R,O) = MSESSH MSESSR 1 MSESSR . (14)

In the following, the interpretation of the various terms of this decomposition are discussed.

2.4 Interpretation of the MSESS and its decomposition

2.4.1 MSESS

The MSESS is an integral measure of the improvement in accuracy of the hindcasts H over the climatological forecast Ō or another reference forecast R with respect to the observations (reanalysis) O. The MSESS can take arbitrary large negative values (as hindcast performance can be arbitrarily bad) but it is bounded from by 1 for positive values. Positive values indicate a gain in accuarcy for the hindcast compared to the reference; zero or a negative values indicate no skill.

2.4.2 Correlation

The correlation coefficient represents a dimensionless measure of the linear relationship between hindcast (or reference) and the observations, ranging from -1 to 1. In the case of the MSESS comparing the hindcast against the climatological forecast the component MurCSS of the central evaluation system (CES) yields the correlation coefficient rHO between H and O. In the case of an analysis of the hindcast against another reference prediction the gain in correlation rHO rRO is shown.

2.4.3 Conditional Bias

The conditional bias is a measure of the reliability of the hindcast/reference prediction, i.e. it gives information about the distribution of the observation given the hindcast, p(oh) [see, e.g., Wilks2011]. Its meaning can be illustrated with a linear regression of the hindcast against the observations, see Fig. 1. The slope of this regression line can be expressed by the correlation coefficient and the standard deviations m = sOsH rHO. Thus, the conditional bias gives an indication of the ratio of the variances of hindcast and observation. Or in other words: given the hindcasts, it gives information about the distribution of the observations. Figure 1 gives an example for a positive correlation coefficient.


PIC
(a) m > 1, conditional bias > 0
PIC
(b) m < 1, conditional bias < 0

Figure 1: Illustration of the conditional bias by the slope of a linear regression for a positive correlation coefficient for a positive (a) and a negative (b) conditional bias. Both cases have a positive correlation coefficient rHO > 0


For a positive correlation coefficient a positive/negative conditional bias indicates a smaller/greater variance of the hindcast compared to the observations. For a negative correlation this relationship is reversed.

The component MurCSS of the central evaluation system (CES) gives for the analysis of the hindcast H against a reference prediction R the gain in the conditional bias

rHO sH sO rRO sR sO (15)

3 Evaluating ensemble spread

3.1 Continuous ranked probabiliy score

Additionally to the question whether the ensemble mean is an accurate prediction, it rests to clarify whether the ensemble spread is an appropriate representation of the forecast uncertainty. To this end, Goddard et al. [2013] suggested to employ the continuous ranked probability skill score (CRPSS). The CRPSS is a skill score based on the continuous ranked probability score (CRPS), a quadratic measure in probability space, [e.g., Wilks2011].

CRPS =[F(y) F 0(y)]2, (16)

with F(y) being the cumulative distribution function (CDF) derived from the hindcast and, as the observations are not assigned a probability, F0(y) is the Heaviside (or switch) function. The Heavyside function is defined as

F0(y) = Θ(xy) 0 y < x , 1 y x, , (17)

and basically jumps from 0 to 1 at the observation y. CDFs of hypothetical hindcasts and the Heaviside function are schematically depicted in Fig. 2(left). The other two panels of Fig. 2 illustrates how the CRPS works, the middel panel gives the difference of the two forecasts F(y) and an observation F𝜃(y), and the right panel shows the squared difference.


PICPICPIC

Figure 2: CDF of two hypothetical hindcasts and the Heavyside function (a), difference between the CDFs and the Heavyside function (b), and the squared difference (c).


For this example, we deliberately chose two forecasts distributions with the same mean but different variance to illustrate that the CRPS is sensitive to the variance of the forecasted probability distribution. For an ensemble prediction system, the forecast distribution is derived from a discrete (and often small) ensemble. This can be done by, e.g., deriving an empirical cumulative distribution function (ECDF) based on the ranks of the individual members, or assuming an underlying Gaussian distribution and estimate its mean and variance, or by various other methods, such as ensemble dressing [Broecker and Smith2008]. Goddard et al. [2013] suggested to assume the Gaussian hypothesis and use the parametric form of the CRPS derived by Gneiting and Raftery [2007]. For a hindcast ensemble with mean Ĥj and variance σĤj2 at hindcast dates j, it reads

CRPS(Ĥj,σĤj2,O j) = σĤj 1 π 2ϕ Oj Ĥj σĤj Oj Ĥj σĤj 2Φ Oj Ĥj σĤj 1, (18)

with ϕ(.) and Φ(.) denoting the Gaussian probability density and distribution function, respectively. This setting allows to assess the ensemble spread for the hypothetical case of an unbiased forecast: the ensemble mean Ĥj is set to the climatological mean derived from the observations, Ĥj = Ō.

3.2 Continuous ranked probability skill score

As the goal is to assess whether the ensemble spread is a useful measure for the forecast uncertainty, we compare the ensemble forecast to a reference forecast with the desired spread. Goddard et al. [2013] suggested an unbiased forecast with spread equal to the MSE between the ensemble mean and the observations. The latter is a valid representation of the forecast uncertainty. Hence, the CRPSS for this particular case reads

CRPSSES = 1 CRPSH(Ō,σ2,O) CRPSMSE(Ō,MSEH,O), (19)

and it basically compares the ensemble spread σ2 to the MSEH. Please note that for the desired case σ2 = MSEH, the ratio on the rhs is 1 and the CRPSSES = 0. Thus, for the case that the ensemble spread compares well to the MSEH (our measure for forecast uncertainty), i.e. for the optimal case, the skill score used here attains 0 not 1, as one would typically expect. Figure 3 illustrates this behaviour based on a simulation study with simple stochastic processes emulating the observations and hindcasts.


PIC

Figure 3: CRPSSES for various ratios of the ensemble spread and MSE from a simulation study. The solid lines are the mean over 100 repetitions, the shaded areas comprise ± 1 standard deviation.


[ENSEMBLE SPREAD SCORE HAS BEEN RECENTLY IMPLEMENTED BUT IS NOT DESCRIBED HERE YET.]

4 Verification with MurCSS

4.1 Input

The plugin MurCSS for the CED calculates the MSESS and its decomposition. This sections describes the various options for the plugin. It is useful to search the data for the analysis beforehand with sol_search to reduce the possibilities of the options. Table 1 list and describes all possible options.


Output Output directory
default: /pf/b/user/evaluation_system/output/murcss
Output plots Output directory of the produced plots,
default: /pf/b/user/evaluation_system/plotst/murcss
Decadals Specify the experiments you want to use. I.e. 1960,1965,1970,..,1995.
mandatory Or you can write 1960:5:1995. To exclude a year (for example 1970)
you can write 1960:5:1995-1970
Variable The name of the variable you want to analyze (CMOR). The website
mandatory provides a select list with a search function.
Project1 That could be BASELINE1, BASELINE0 or CMIP5. It is also possible
mandatory to specify a custom project if the username is included to the
evaluation system.
Product1 Product to Project1, part of the datapath
mandatory
Institute1 Institute to Project1, part of the datapath
mandatory
Model1 Model to Project1, part of the datapath
mandatory
Experiment1 Experiment to Project1 I.e. ”decs4e” or ”decadal”, part of the datapath
mandatory
Project2 Project which is compared to Project1. That could be BASELINE1,
mandatory BASELINE0 or CMIP5. It is also possible to specify a custom
project if the username is included to the evaluation system.
Product2 Product to Project2, part of the datapath
mandatory
Institute2 Institute to Project2, part of the datapath
mandatory
Model2 Model to Project2, part of the datapath
mandatory
Experiment2 Experiment to Project2 I.e. ”decs4e” or ”decadal”, part of the datapath
mandatory
Leadtimes Leadtimes to analyze, comma separated list, time range can be
mandatory hyphenated, default: 1,2-9
Observation Observations to compare, i.e. merra or hadcrut. It is also possible
mandatory to specify a path of an observation file.
Ensemblemembers Here you can specify a subset of the ensemblemembers you want to use.
optional Insert a comma separated list, for example r1i1p1,r2i1p1,r3i1p1
Significance Whether you want to calculate significance levels (95%) using
optional bootstraps method. WARNING This could take up to 1 day!
Bootstrap number Number of bootstrap runs, using if Significance is true
optional default: 500
Level Here you can choose a certain height for the analysis, for example
optional 700 000 (Pa). At the moment only one level can be specified.
Tip: Find out common levels in observation and project beforehand.
Lonlatbox Here you can specify a region. If you want to select the region
optional 100W-40E and 20N-80N than you have to write -100,40,20,80
Maskmissingvalues Whether you want to mask grid points with missing values in the
mandatory observation file or not, default: TRUE
Cache Path of the working directory
mandatory default: $USER_CACHE_DIR/$SYSTEM_TIMESTAMP
Result grid You can specify a gridfile or a grid description like r20x25
optional

Table 1: Options for MurCSS

4.2 Output

Here, the output of the plugin is described, for comparing hindcasts against the climatological forecast (Sec. 4.2.1) and hindcasts versus another reference forecast (Sec. 4.2.2).

4.2.1 Hindcast vs. climatological forecast

Table 2 lists the output obtained for the analysis of the hincasts versus the climatological forecast.


msess Mean Squared Error Skill Score according to equation 12
rmss Root of the Mean Squared Error Skill Score according to equation 12
correlation correlation coefficient rHO as a measure of the linear relationship
between the hindcast and the observations
conditional bias measure of the reliability, rHO sH sO
std_ratio ratio of the standard deviations of hindcast and observations sO sH
(part of conditional bias)
biasslope slope of the linear regression between hindcast and observations
m = sOsH rHO, conditional bias=0 if m=1
biasslope-1 slope of the linear regression between hindcast and observations minus 1
m = sOsH rHO 1, conditional bias=0 if m=0

Table 2: Output obtained for the comparison of hindcasts versus the climatological forecast.

4.2.2 Hindcast vs. reference prediction

Table 3 lists the output obtained for the analysis of the hincasts versus another reference forecast, e.g., the uninitialised historical runs.


msess Mean Squared Error Skill Score calculated with equation 13
rmss Root of the Mean Squared Error Skill Score calculated with equation 13
correlation gain in correlation of the hindcast with respect to the reference prediction
rHO rRO
conditional bias gain in conditional bias of the hindcast with respect to the reference
prediction |rRO sR sO||rHO sH sO|
biasSkillScore represents the improvement of the conditional bias by the slope of the
linear regression, 1 |mHO1| |mRO1|. Positive values indicate a higher
conditional bias with respect to the reference prediction and negative
values the opposite.
biasslope-1 gain in conditional bias representated by the slope of the linear regression
minus 1 of the hindcast with respect to the reference prediction
|mHO 1||mRO 1|

Table 3: Output obtained for the comparison of hindcasts versus another reference forecast.

4.3 Example analysis

This section gives an example analysis with the plugin MurCSS. Suppose, we want to estimate the skill of baseline1 against baseline0 of the near-surface air temperature tas with the HadCrut data set as observations. First, solr_search identifies the respective data series. For baseline1 the command looks like

solr_search data_type=baseline1 variable=tas time_frequency=mon model=MPIESMLR

The output is a list of the paths of all available data files, for example:

/miklip/integration/data4miklip/model/global/miklip/ baseline1/output/
MPI-M/MPI-ESM-LR/decs4e1960/mon/atmos/tas/r1i1p1/
tas_Amon_MPI-ESM-LR_decs4e1960_r1i1p1_196101-197012.nc

With the knowledge of the directory structure we have made the following adjustments for the analysis Baseline1 vs. Baseline0 :

Decadals 1960:5:1995 Variable tas
Project1 baseline1 Product1 output
Institute1 mpi-m Model1 mpi-esm-lr
Experiment1 decs4e Project2 baseline0
Model2 mpi-esm-lr Experiment2 decadal
Leadtimes 2-5 Observation HadCrut
Significance TRUE

After the analysis is completed the netcdf-files and the produced plots (section 4.2) are available. Some of these plots are depicted in Fig. 4 for the analysis Baseline0/Baseline1 vs. Climatology.


Baseline0 / Climatology Baseline1 / Climatology
MSESS PIC PIC
Correlation PIC PIC
Conditional Bias PIC PIC
Biasslope-1 PIC PIC
Std Ratio PIC PIC

Figure 4: MSESS, Correlation, Conditional Bias, Biasslope-1 and the ratio of the standard deviations for the analysis Baseline0 against Climatology (left) and Baseline1 against Climatology (right) for the near-surface temperature tas. Statistically significant values are marked by a dot and grid points with missing values in the observations are masked.


In the analysis with the standard deviation of the observations as reference the MSESS describes how well the observations can be reproduced. A positive value indicates a good reproduction of the observations with a perfect hindcast at 1 and negative values the opposite. The near-surface air temperature is better represented by Baseline1 especially in the indian ocean and southeast asia with respect to Baseline0. But both models can not reproduce the observations in the pacific ocean sufficiently.

The MSESS can be decomposed into the correlation and the conditional bias (see Sec. 2.3). High postive values of the correlation indicate a good accordance of the model and the observations. High negative values suggest a strong linear relationship as well, but the different signs show a poor accordance of model and observations. The improved MSESS of Baseline1 with respect to Baseline0 is mainly produced by a higher correlation coefficient at most of the grid points.

The observations can be best represented with a conditional bias of 0. In Fig. 4 it can be seen that Baseline1 shows the lowest conditional bias, but there are still some areas, for example the pacific ocean and the nothern atlantic, with a pronounced conditional bias.

Another approach to understand the conditional bias and its relationsship to the standard deviations of the hindcast and the observations is the slope of the linear regression. In Fig. 4 the biasslope -1 is depicted as well. It is noticable that the pattern are quite similar to the pattern of the conditional bias. For a positive correlation coefficient a positive slope and a positive conditional bias suggest a higher standard deviation of the observations with respect to the standard deviation of the hindcast. If the correlation coefficient is negative a positive slope and conditional bias indicate a higher standard deviation of the hindcast. For the near-surface air temperatur the correlation and the conditional bias is positive in Central Europe for Baseline1 and Baseline0, which suggest a higher standard deviation of the observations. The negative correlation and conditional bias over the indian ocean in Baseline0 indicate a higher standard deviation of the observations as well, but in Baseline1 the variations of the hindcast is higher due to the positive correlation and the negative conditional bias.

The ratio of the standard deviations SOSH is also shown in Fig. 4.

In Fig. 5 the output of MurCSS is depicted for the analysis Baseline1 vs. Baseline0. A positive value of the MSESS suggest a improved accuracy of the hindcast compared to the reference and a negative value indicates the opposite.


PIC
(a) MSESS
PIC
(b) Correlation
PIC
(c) Conditional Bias
PIC
(d) Biasslope -1
PIC
(e) Bias SkillScore
Figure 5: MSESS, Correlation, Conditional Bias, Bias SkillScore and Biasslope-1 for the analysis Baseline1 against Baseline0 for the near-surface temperature tas. Statistically significant values are marked by a dot and grid points with missing values in the observations are masked.


In Fig. 5 a) it can be seen, that Baseline1 is more accurate in most of the regions. Only a few small areas, for example in South America or the Labbrador Sea the accordance of Baseline0 and the observations is higher. Fig. 5 b) shows the gain of the correlation coefficient of Baseline1 compared to Baseline0. Positive values indicate a improved correlation of Baseline1. The analysis of the near-surface air temperature show an improved accuracy of Baseline1 with respect to Baseline0 for almost every gridpoint as well. In Fig. 5 c) the gain of the conditional bias is illustrated. Here it is noticable, that negative values represent a reduction of the bias and thus a improved accuarcy. For example in the Labrador Sea the conditional bias is higher in Baseline0, the correlation is, however, slightly increased. Allover the contribution of the bias overbalance the effect of the correlation, thus the MSESS shows a decreased accuracy in the Labrador Sea.

An additional approach to illustrate the conditional bias is the gain of the biasslope -1, which is shown in Fig. 5 d). At this figure the negative values suggest a reduction of the conditional bias as well. Thus the pattern is quite similar to Fig. 5 c). Due to the calculation of the gain with absolute values (|mHO 1||mRO 1|) a statement about the gain in the standard deviations is no longer possible.

A third method the illustrate the condiational bias is the bias Skill Score 1 |mHO1| |mRO1|, Fig. 5 e). This time, positive values suggest a higher conditional bias of Baseline1 compared to Baseline1 and negative values the opposite. Nevertheless, the Figs. 5 c)-e) illustrating the conditional bias have similar patterns.

References

   J. Broecker and L. A. Smith. From ensemble forecasts to predictive distribution functions. Tellus A, 60(4):663 ff, 2008.

   T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc., 102(477):359–378, 2007.

   L. Goddard, A. Kumar, A. Solomon, D. Smith, G. Boer, P. Gonzalez, V. Kharin, W. Merryfield, C. Deser, S.J. Mason, B.P. Kirtman, R. Msadek, R. Sutton, E. Hawkins, T. Fricker, G. Hegerl, C.A.T. Ferro, D.B. Stephenson, G.A. Meehl, T. Stockdale, R. Burgman, A.M. Greene, Y. Kushnir, M. Newman, J. Carton, I. Fukumori, and T. Delworth. A verification framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40:245–272, 2013. ISSN 0930-7575. doi: 10.1007/s00382-012-1481-2. URL http://dx.doi.org/10.1007/s00382-012-1481-2.

   I. T. Jolliffe and D. B. Stephenson, editors. Forecast Verification. Wiley-Blackwell, second edition, 2012.

   B. Kirtman, J. A. Adedoyin S. B. Power and, G. J. Boer, R. Bojariu, I. Camilloni, F. J. Doblas-Reyes, , A. M. Fiore, M. Kimoto, G. A. Meehl, M. Prather, A. Sarr, C. Schär, R. Sutton, G.J. van Oldenborgh, G. Vecchi, and H. J. Wang. Near-term climate change: Projections and predictability. In T. F. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex, and P.M. Midgley, editors, Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2013.

   A. H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Wea. Rev., 116(12):2417–2424, December 1988. ISSN 0027-0644. URL http://dx.doi.org/10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.

   D. S. Wilks. Statistical methods in the atmospheric sciences. Academic Press, San Diego, CA, 3rd edition, 2011.