On Model Selection in Cosmology

We review some of the common methods for model selection: the goodness of fit, the likelihood ratio test, Bayesian model selection using Bayes factors, and the classical as well as the Bayesian information theoretic approaches. We illustrate these different approaches by comparing models for the expansion history of the Universe. In the discussion we highlight the premises and objectives entering these different approaches to model selection and finally recommend the information theoretic approach.


Contents
1 Introduction In science we often have several competing theoretical models which try to explain the same natural phenomenon. Based on measured data we want to decide which model is the better one. As an example we consider two different cosmological models, a cold dark matter model (CDM) with a cosmological constant Λ called ΛCDM, and a cold dark matter model with a constant equation of state p = w for the dark energy component called wCDM . With both models we try to explain observations like the cosmic microwave background, galaxy cluster counts, supernovae distance measurements, to name only a view. Models with more parameters typically allow for a closer fit of the data, but are such models with more parameters indeed better (see Fig. 1). In such a context one often refers to Ockham's razor that one should not introduce additional parameters if they are not needed 2 . One task of model selection is to make this statement quantitative.
To set a well defined stage, consider some measured data points d i = (x i , y i ) with i ∈ {1, . . . , N }. A model is providing a function f (x, θ) such that f (x i , θ) is approximating y i for each i. The parameters θ = (θ 1 , . . . , θ K ) are from A ⊂ R K . For simplicity we assume that x, x i , y i ∈ R and also f (x, θ) is real valued 3 . The best fitting parameters θ ∈ A ⊂ R K of the model are then determined from a Bayesian approach, a maximum likelihood procedure, or a simple least-square-fit. If one considers only a single model and has a good idea about the priors for the parameters, then many physicists would agree that a Bayesian parameter estimation procedure is the appropriate thing to do (see [2]).
The situation is more complicated if one considers at least one other model g(x, φ) with parameters φ ∈ B ⊂ R L . For simplicity we name the models after the functions f and g. Typically the dimensions of the parameter spaces differ L = K and also the parameter spaces may not overlap. We can determine the optimal parameters θ and φ for each of the models f and g. But the question remains, which of the models is "better". This situation is illustrated in Fig. 1.
As a starting point we will briefly discuss some of the common methods used for parameter estimation. Then we will present methods used for the selection of models and also comment on the approximations and numerical methods used. In section 3 we use these methods to compare two models for the expansion history of the Universe. In the discussion we highlight the premises and objectives entering the different approaches and recommend the information theoretic procedure, preferably in its Bayesian flavour. In appendix A we summarise properties of statistical tests, the empirical distribution functions, and the Kullback-Leibler divergence. In appendix B we provide some details of the numerical implementation. We assume some familiarity with statistical methods used in physics and cosmology (see for example [4, chap. 40] for a short review). Most of the material shown here is not new. Some of the results are scattered throughout the literature so we try to present them in a coherent fashion and give due reference.  Figure 1: Three polynomials fitted to 10 data points: f (x, θ) = θ 0 + θ 1 x (solid blue line), g(x, φ) = φ 0 + φ 1 x + φ 2 x 2 (dashed red line) h(x, ψ) = ψ 0 + ψ 1 x + ψ 2 x 2 + ψ 3 x 3 + ψ 4 x 4 + ψ 5 x 5 (dotted green line). Which of these polynomials is the "best" model (see also Munroe [3]).

Parameter estimation
In the following we will give a short review of different methods for parameter estimation. The starting point for a comparison of a model with the data is in most cases the likelihood. To specify the likelihood function p f (d | θ), the model itself f (x, θ) and an error model for the data is needed. The vector of measurements is d = (x i , y i ) N i=1 and p f (d | θ) is the probability of obtaining the measured data points d, given the parameters θ in the model f . Often one assumes Gaussian errors, then the likelihood reads . . , f (x N , θ)) T and the covariance matrix Σ. With a maximum likelihood estimator we determine the parameters θ which are maximising p f (d | θ ). Hence in choosing the parameter θ , the data points d become the most probable data points given the model f . The least square method is a simplified version of the maximum likelihood estimator [5]. The likelihood is assumed to be Gaussian with a diagonal Σ and the σ i 's on the diagonal. Searching the maximum of p f (d | θ), or of gives the same result as searching for the minimum of This minimum determines the best parameters θ * of the least square fit.
In a Bayesian setting we also need the prior distribution p f (θ) of the parameters for model f . Using Bayes theorem we can determine the posterior distribution the distribution of the parameters θ, given the data d and the model f . The normalisation constant p f (d) is called the evidence or marginal likelihood. The evidence can be obtained by an integration in parameter space Given the data and the model, the evidence is a normalisation constant of the posterior distribution. Hence, we do not need to calculate the evidence if we want to determine the maximum θ or the mean of the posterior distribution p f (θ | d) only. The maximum 4 θ is called the maximum posterior (MAP) estimate. Clearly, there is more to parameter estimation than we covered here. We did not discuss how to choose priors, how to deal with nuisance parameters, or how to determine confidence or credibility regions. Nevertheless we provided the necessary prerequisites to be able to discuss model selection.

Model selection
Quite a few methods for model selection have been developed. Cosmological and astrophysical oriented reviews and books are for example [6], [7] [8]. A more philosophically inclined introduction with basic examples can be found in Sober [9, chapter 1].

The goodness of fit
The so called "goodness of fit" may serve as a starting point for this discussion, since it is often the first method students of physics learn in their lab-courses. One calculates the so called reduced χ 2 f,red where χ 2 f is calculated using eq. (3) with the best fit parameters θ * of the model. n df is the number of degrees of freedom, typically n df = N − K, with N the number of data points and K the number of parameters in the model f . If χ 2 f,red ≈ 1, this is considered a good fit, if χ 2 f,red > 1 a bad fit and if χ 2 f,red < 1 an overfit. Remember, one starts with a maximum likelihood estimate and makes assumptions about the data-and error-model which are often stark oversimplifications. As a remedy the number of degrees of freedom is determined from the "effective" number of independent data points. The problems of this approach are summarised in [10]. Some of the motivation for using the χ 2 f derives from the theory of statistical tests (see [11] or [8]). The p-value used in these tests is calculated as with G n df the cumulative probability distribution function of a χ 2 distributed random variable with n df degrees of freedom (see appendix A). Clearly p is one-to-one with χ 2 f . This p-value is the probability that the data may arise from the null hypothesis (our model including the error model). The smaller the p-value, the greater is the statistical incompatibility of the data with the null hypothesis. Hence, given the model (the null hypothesis), a small p-value allows us to reject the model using statistical test. From the p-value however we do not learn anything about the false acceptance rate (see appendix A). The p-value is a statement about data in relation to a specified hypothetical explanation (our model), not about the data itself, and especially not about other models. See the recent statement of the American Statistical Association on the (restricted) applicability of p-values [12]. Sometimes the goodness of fit or the related p-value is used for model selection directly but more often the likelihood ratio is utilised.

Likelihood ratio test
For the selection of models Neyman and Pearson [13] developed the likelihood ratio test. As usual we first consider so called nested models. The model f with parameter space A is a special case of the model g with parameter space B. More formally A B and f | A ≡ g| A restricted to A. Now we determine the best fitting parameter θ ∈ A, and φ ∈ B and calculate Our null hypothesis is "f is the true model with θ ∈ A". The alternative is "g is the true model with φ ∈ B but φ ∈ A". With these maximum likelihood estimates θ , and φ we calculate L. Fixing a significance level α one can proceed and specify the test. Often one relies on Wilk's theorem [14]: for nested models and for large sample sizes N the λ = −2 log L is approximately χ 2 -distributed with the number of degrees of freedom equal to ν = dim(B) − dim(A). The p-value is calculated as p = 1 − G ν (λ). We reject our null hypothesis if p < α with a predefined significance level α (see the appendix A). Contrary to the situation discussed with the goodness of fit, the alternative hypothesis is fully specified in the likelihood ratio test. The false negative rate 5 is the probability that the true alternative hypothesis (our model g) gets rejected. The Neyman-Pearson-Lemma [13] tells us that a test based on the likelihood ratio is minimising the false negative rate. In this sense the likelihood ratio test is optimal.
In the introduction we already considered a more general, non-nested setting. Vuong [15] discusses the likelihood ratio test for overlapping or non-nested models and he derives the relevant limiting distribution (not necessarily a χ 2 -distribution anymore). The application of the likelihood ratio test in these more general setting is reviewed by Lewis et al. [16].

Bayesian model selection
Bayesian methods, like the evidence and Bayes factors are nowadays frequently used to compare cosmological models (see for example [17], [18], [19], and [20]). The definition of the evidence in eq. (5) tells us that p f (d) is the conditional probability of obtaining the data vector d given the model f . For some simple models the evidence can be calculated and a suggestive interpretation emerges [17], but in most cases the evidence of one model by itself is not very informative. Its usefulness derives from the evidence ratio used in Bayesian model selection.
If we consider another model g we may compare its evidence with the evidence of the model f . For a consistent comparison of models within a Bayesian framework we need the joint probability p(f and d) = p f (d)π f of model f and data d. Similarly for p(g and d) = p g (d)π g of model g and the same data d. The π f and π g are the prior probabilities we assign to our models. Often these probabilities are chosen equal π f = π g , and the ratio of the full probabilities p(f and d) p(g and d) reduces to the evidence ratio, also called Bayes factor. A B f g larger than unity suggests, that we should favour model 6 f over model g. In applications almost exclusively π f = π g is assumed for the models, probably because one does not want to add a subjective touch to the analysis. The Bayes factor, as any result from a Bayesian analysis, explicitely depends on the prior distributions for the parameters of the models. You may have prior knowledge that allows you to specify a so called "subjective" prior [21]. Practitioners often use priors suggested by results from preceding observations or studies. Different approaches are used to motivate the so called "objective", "non-informative", or "reference" priors. Their definition can be based on the principle of insufficient reasoning, the maximum entropy principle, the invariance under transformations or scaling, or the missing information principle (see e.g. [22], [23], [24]). Kass & Wassermann [25] provide an overview and rules for selecting among these priors. For a stimulating dialogue with J.M. Bernardo on prior probabilities see [26] (don't miss the comments on this dialogue by D.R. Cox, A.P. Dawid, J.K. Ghosh and D. Lindley in the same issue). In any case, it is important to select the prior carefully, and it seems advisable to investigate the dependency of the model selection on the prior.
The calculation of the evidence (eq. (5)) can be quite challenging. Friel & Wyse [27] provide a review of different techniques. One of the first approximations for the evidence is due to Schwarz [28]. Asymptotically he arrives at the so called Bayesian Information Criterium 7 (see [29] for a detailed derivation) If we compare models, a smaller value of the BIC is better. The marginalised likelihood p f (d i | θ ) used in eq. (10) is obtained by fixing d i = (x i , y i ) and integrating over the remaining Beyond this asymptotic approach, several numerical techniques are currently used to calculate the evidence. In low dimensional parameter spaces a direct integration using standard numerical methods is sometimes possible. In cosmology a method derived from the ideas of Chib [30] has been used to estimate the evidence from a given MCMC chain [31,32]. With nested sampling one estimates the evidence directly [33]. Several implementations of this approach are currently in use (see e.g. [34] and [35] and references therein). Comparisons of further numerical methods are discussed by [36], [37], and [27].

Information theoretic approach to model selection
The information theoretic approach is based on the concept of minimising the distance between the distribution of the model and the distribution of the data. We assume that some observational data d is drawn at random from the true but unknown distribution with probability density p T (d). From our model f and the data d = {d i } N i=1 we construct a predictive distribution p p,f (d) for a single new observations d. Several possibilities for such a predictive distribution exist and we will discuss the classical and the Bayesian approach. For now we assume that we know such a predictive distribution p p,f (d) for our model f which we want to compare to the true distribution p T (d). We measure the discrepancy between the two distributions using the Kullback-Leibler (KL) divergence (see the appendix A) For model selection we rank the models f and g according to the value of D(p T | p p,f ) and D(p T | p p,g ) -the smaller the better.

Classical approach
In the classical information theoretic approach to model selection the predictive likelihood is used as the predictive distribution. This leads to the so called Akaike Information Criterion (AIC, see [38,39]). Applications of the AIC in cosmology are discussed in [40] and [41].
Although several definitions of a predictive likelihood exist (see [42] for a review) we follow [39] and use the marginalised likelihood eq. (11) as the predictive likelihood which is the likelihood of a new data point d assuming the maximum likelihood estimate θ of the parameters. This is already a well defined approach if d is a simple random variable. However in our regression setting we have d = (x, y) and we compare the predictions of the model f (x, θ) to the observed value y. For each of the observed data points d i = (x i , y i ) we know the uncertainties of the measurements entering the likelihood (compare eq. (1) and eq. (11)). But how do we calculate p f (d | θ ) for a d = d i ? At a first glance it seems necessary to introduce an additional model for the uncertainties. As an example we could interpolate between neighbouring values of the marginalised likelihood (eq. (11)) to determine p f (d | θ ).
Fortunately we will see below, that this is not necessary since we evaluate Let us start with the derivation of the AIC (following loosely [43]). The first term on the second line in eq. (12) does not depend on the model f . The second term is the expected log likelihood for the model f for all possible data This expectation is calculated using the true cumulative distribution F T . Unfortunately the true cumulative distribution F T is unknown. From the observational data d = ( it is always possible to construct the empirical cumulative distribution function F T,N (d) as a sum of step functions (see appendix A). Then an estimate of the expected log likelihood η(f ) is given by where we used dF T, (30)). Since we estimated the best parameter θ from the same dataset we use to construct the empirical distribution function and the bias corrected expected log likelihood (the second term in eq. (12)) is Clearly, this only shifts the problem from η(f ) to b(f ). In Appendix B.1 we explain the method of Konishi & Kitagawa [43] how to obtain a bias corrected estimate η(f ) : The model with the larger value of η is favoured.
Assuming that the true distribution p T is part of the family of distributions p f (z | θ) and that θ is a maximum likelihood estimate, it is shown in [38] that b(f ) asymptotically has the form K/N with K the dimension of the parameter space and N the number of data points. Rescaling this approximate expression of eq. (16) by −2N we arrive at the Akaike Information Criterium 8 The model with the smaller value of the AIC is favoured. A comparison of eq. (17) with eq. (10) shows that the AIC and the BIC differ in how they disfavour high dimensional parameter spaces. These terms are sometimes called Ockham's razor terms. Keep in mind that the derivations of the AIC and the BIC start from different principles: the BIC starts from the evidence and the AIC from the proximity of a model to the true distribution. Several extensions and "corrections" to the AIC have been proposed (see for example [44]). A corrected AIC, better suited for smaller sample sizes, was derived by [45] (see [46] for a unifying derivation of AIC and AICc).
Several of the assumptions, entering the derivation of the AIC, can be relaxed and the estimates of the bias b(f ) can be improved (see [43] for a summary). Indeed θ need not be a maximum likelihood estimate; the asymptotic of b(f ) is known for Fisher consistent estimates θ and also for MAP estimates θ , as obtained from a Bayesian parameter estimation procedure. However we are only aware of the bootstrap procedure discussed by [43] as a direct numerical approach (see appendix B.1).

Bayesian approach
In the classical approach we use the best fit marginalised likelihood p f (d | θ ) as the predictive distribution. In a Bayesian approach we use the posterior predictive distribution With the posterior distribution p f (θ | d) of the parameters given in eq. (4), and the marginalised likelihood p f (d | θ) from eq. (11) we can define the posterior predictive distribution With p ppd,f (d) in eq. (12) we compare the posterior predictive distribution to the true distribution p T (d) using the KL-divergence: The first term is not depending on the model f and the last term in eq. (20) is We follow the strategy from Sect. 2.4.1 and insert the empirical distribution function F T,N (d) for F T (d) and obtain where we expressed integral over p f (d i | θ) in parameter space as the expectation value E post [·] with respect to the posterior distribution p f (θ | d) of the parameters. The model with the larger value κ is favoured. As discussed in section 2.3 the value of κ is depending on the chosen prior.
The expectation E post [p f (d i | θ)] can be evaluated with Markov Chain Monte Carlo (MCMC) methods. We start with one chain simulating draws from p f (θ | d). Only this chain is needed to calculate an estimate of E post [p f (d i | θ)] for each of the data points d i . Contrary to section 2.4.1 we do not use a point estimate in the calculation of κ(f ). Therefore we think that a bias is not important in the calculation of κ(f ). Clearly this would deserve some further investigation.

Other methods
A few other methods for model selection are in use. Closely related to the Bayes factor is the relative belief ratio which measures the belief gained over the prior after an observation (see [47]). Seehars et al. [48] use the KL-divergence to quantify the information gained from new data sets and define the "surprise". The Deviance Information Criterium (DIC) was constructed by Spiegelhalter et al. [49] as a revised version of the AIC (see [50] and [51]). Although the DIC is popular, there is some criticism (see [52] for a summary). The derivation of the AIC [38], as sketched in section 2.4.1, assumes that the parametric models are regular 9 . For typical applications in cosmology, as given in section 3, this is the case. However models defined by multilayered neural networks are generically singular. For singular models Watanabe derived the Widely Applicable Information Criterium (WAIC, [53]). With cross-validation we split the data set. A training sample is used to determine the optimal parameters of the model and the remaining part (the validation sample) is used for estimating the discrepancy between the optimised model and the data. Then one selects the model with the smallest discrepancy (see [54] for a survey). Gelman et al. [50] compare the AIC, DIC, WAIC and cross-validation in a variety of situations. In the introduction we mention Ockham's razor and the principle of parsimony. This can be formalised by assigning the algorithmic complexity as a unique measure to the model describing the data [55]. Typically one is not able to calculate the algorithmic complexity, but one can estimate the so called minimum description length (MDL, [56]). In its asymptotic form the MDL is similar to the AIC and BIC with yet another Ockham's razor term. The approach from complexity theory and from information theory seem to be closely related, but this is an open issue (see also [57]).

An application in cosmology
We illustrate these approaches to model selection by a classical example from cosmology: the accelerating expansion of the Universe as determined from redshift and luminosity measurements of supernovae. A supernova type Ia (SN Ia) is a stellar explosion with a well defined luminosity [58]. In astronomy the absolute luminosity is typically specified in logarithmic units, the absolute magnitude M B in a given frequency range, here the B-band. The observed flux is measured by the apparent magnitude m B (again in logarithmic units). The distance modulus is defined as µ := m B − M B . The redshift z of the supernova or of the hosting galaxy is measured spectroscopically. In a homogeneous and isotropic universe model the distance modulus -redshift relation can be calculated. Depending on the matter content of the Universe we get µ(z, θ) = 5 log 10 d L (z, θ) + 25, with luminosity distance d L in Mpc and the redshift z of the supernova. The model dependence enters through the luminosity distance d L (z, θ) with the parameters θ. We do not pursue an exhaustive investigation of the currently fashionable cosmological models and therefore fix some of the otherwise free parameters. Consider Nicola et al. [59] and Raveri & Hu [60] for a comparison of more models using comprehensive data sets. We assume a spatially flat Universe (Ω k = 0) and choose H 0 = 70kms −1 Mpc −1 compatible with the data from the Union 2.1 sample [61], but slightly larger than the Planck value [62]. Also if we consider only supernovae, the value of the Hubble parameter is completely degenerate with the absolute magnitude. The overall scale is given by the Hubble distance d H = c Ho = 4.28 Gpc, with c the speed of light. We consider two cosmological models: 1) The ΛCDM model with one parameter, the dimensionless density parameter Ω m . Since we assume a flat background we have 0 ≤ Ω m ≤ 1 and Ω Λ = 1 − Ω m for the cosmological constant term. The luminosity distance is given by (see e.g. [63]).
2) The flat wCDM model, with a constant equation of state p = w for the dark energy component, has two free parameters (Ω m , w). The density parameter obeys 0 ≤ Ω m ≤ 1 and the cosmological term Ω Λ = 1 − Ω m at present. The time dependence of the cosmological term is parametrised using w and the luminosity distance is then The observational data d i = (z i , µ i ) are the redshift z i and the distance moduli µ i of SN Ia. In the Union 2.1 sample we have N = 580 such observations from SN Ia together with an estimate σ µ,i of the uncertainty of each distance modulus [61,64]. Assuming a cosmological model we calculate the distance modulus µ i given the redshift z i depending on the cosmological parameters of the model. The likelihood is the starting point for all the approaches to model selection we discussed. Similar to eq. (1) we assume a Gaussian likelihood. For the flat ΛCDM model we have with d = (z i , µ i ) N i=1 and the distance modulus µ(z i , Ω m ) calculated from the model. For the marginalised likelihood evaluated at d i we have from eq. (11) The likelihood p w (d | Ω m , w) and marginalised likelihood p w (d i | Ω m , w) of the wCDM model are defined analogously. In the Bayesian analysis we need to specify the priors. We assume a uniform distribution on  2078. This suggests that we should prefer the wCDM model. Bayesian information theoretic approach: We estimate κ and obtain κ Λ = 0.20469 for the ΛCDM model and κ w = 0.20459 for the wCDM model. Therefore the ΛCDM model is preferred over the wCDM model. Neither using the least-square results nor with the likelihood ratio test we arrive at a definite conclusion. Both models fit the data, and we also cannot rule out ΛCDM or the wCDM model. In the Bayesian approach often Jefferey's [22] scale is employed to express the numerical value of the evidence ratio B Λw in words 10 . Hence with B Λw = 5.4 we have "substantial evidence" 10 Jeffreys' scale (see appendix B in [22]) for the evidence ratio B translated to our conventions reads: B < 1: negative evidence; 1 ≤ B < √ 10: barely worth mentioning; √ 10 ≤ B < 10: substantial; 10 ≤ B < 10 3/2 : to support the ΛCDM over the wCDM. However such a "universal" scale is disputed (see e.g. [65] or [66]). Similarly, the mere comparison of numbers, like we did with the AIC, the BIC, η and κ, is not satisfying. A scale is missing. We do not want to propose a universal scale, which probably does not exist, but we suggest a model dependent approach to investigate the stability of our model selection.
As a concrete example, consider the Bayesian information theoretic approach and the values of κ Λ = 0.20469 and κ w = 0.20459 for the ΛCDM and the wCDM model, respectively. To see whether this difference is important we repeatedly generate artificial data sets and calculate the κ Λ for each of these data sets. This allows us to estimate the dispersion ∆ κ Λ . Clearly the fluctuations are depending on how we generate our artificial data set. We start with the Union 2.1 sample [61] and keep the redshift z i and the uncertainty σ µ,i fixed for each supernova. We generate randomised distance moduli µ i fluctuating around the prediction of the ΛCDM model with where s i is a random number, normally distributed with zero mean and standard deviation σ µ,i . This gives us an artificial data set (z i , µ i , σ µ,i ) 580 i=1 . For one hundred of these artificial data sets we calculate κ and estimate the dispersion of κ using the mid-spread 11 ∆ κ Λ ≈ 0.037. This dispersion estimate ∆ κ Λ is two orders of magnitude larger than the difference between κ Λ and κ w . Hence, using the κ from the Bayesian information theoretic approach we can not select one of the models. This is not a full evaluation of the fluctuations present in the model, but it helps us to assess the relevance of our results in a model dependent way. We also calculated such dispersion estimates for the p-value, the AIC, η, the B Λw , and the BIC. We obtain ∆ p Λ ≈ 0.46, ∆ AIC Λ ≈ 42, ∆ η Λ ≈ 0.034, ∆ B Λw ≈ 3.4, ∆ BIC Λ ≈ 42. For these quantities the dispersion estimates are always significantly larger than the observed differences between the ΛCDM and wCDM models. Hence we cannot decide which model is better. This also illustrates that all the key figures used in model selection, like the p-value, the AIC, the κ, etc. are random variables, depending on the model and the available data set. Fixing levels or using universal scales can hence be misleading (see also [60,67]).
As mentioned in section 2.3 we study the dependence on the priors. We calculate the Bayes factor B Λw and κ for a series of priors. Still we restrict the parameters to the ranges Ω m ∈ [0, 1] and w ∈ [−2, 0], but in addition to the flat distribution we use Jeffreys' prior (a suitably rescaled Beta(1/2,1/2) distribution) and a series of truncated and renormalised Gaussian distributions. For the truncated Gaussian distributions we vary the width from almost flat on the intervals to strongly peaked and we use two different mean values, one is centred on the "correct" value (the MAP estimate). For all these priors the posterior distributions are very similar and the MAP estimates agree within the fluctuations. The κ Λ from the Bayesian information theoretic approach ranges from 0.204694 to 0.204697 for these priors. Only in the extreme cases, where we have negligible overlap between the prior and the posterior distribution, we get a value outside this range. Hence, if the prior is sufficiently broad and shows some overlap with the posterior distribution we get consistent results irrespective of the prior. A similar behaviour is observed for the Bayes factor B Λw . strong; 10 3/2 ≤ B < 100: very strong, 100 ≤ B: decisive evidence. 11 The mid-spread, or inter quartile range, is defined as the difference between 75th and 25th percentiles. It is a robust estimator of dispersion. In a Gaussian distribution the mid-spread is approximately 1.34 times the standard deviation.
So we conclude: Using the Union 2.1 data set we cannot decide wether the ΛCDM or the wCDM model should be preferred.

Summary and Discussion
In physics the construction of models is guided by basic principles (conservation laws, symmetries, etc.). Adding another term, as illustrated in Fig. 1 in the introduction, is often not acceptable because one would violate these principles. For statistical applications in engineering or the social sciences this is often not a major concern. Model selection is used as a criterium to decide whether one should introduce new parameters and new dependencies. Ockham's razor suggests that one should go for the simpler model. However simplicity needs to be quantified (see Sober [68]). The dimension of the parameter space immediately comes to mind as such a measure of simplicity. But the dimension is only a rough and sometimes misleading measure of parsimony (see e.g. [9], and also compare Fig. 1). The goodness of fit, the likelihood ratio, the evidence ratio, or the KL-divergence from the information theoretic approaches are operationally well defined procedures for model selection. They allow quantitative arguments beyond mere qualitative arguments. These methods, given in Sect. 2, are now briefly summarised before we critically discuss them: -With the goodness of fit one ranks models according to their ability to fit the data points.
-With the likelihood ratio you compare the probabilities of your data given the best fitting models. Together with a predefined significance level the likelihood ratio allows you to discard a given model (your null hypothesis) in favour of the alternative model. -In a Bayesian model comparison you use the evidence ratio to compare the joint probabilities of the models and the data. This depends on the likelihood and the prior. -In the classical information theoretic approach you measure how good the best fitting models are at predicting new data. -In the Bayesian information theoretic approach you measure how good the posterior predictive distributions of the models are at predicting new data. The "goodness of fit" based on the χ 2 f,red is sometimes used for model selection. The major shortcoming is that the χ 2 f,red does not factor in any contributions from a false acceptance (compare appendix A). If we specify a second model and assume a Gaussian error model as well as independent sampling, the difference χ 2 f − χ 2 g is related to the likelihood ratio as used in the likelihood ratio test.
Although the likelihood ratio test and the Bayesian model selection derive from quite different approaches towards statistical analysis, they both try to find the true model. Either you discard the false models via tests, or you determine the most probable model. The information theoretic approach is different. There one accepts that a model is an approximation and one tries to identify the model which is closest to the true empirical distribution. This approach allows us to predict new data in the best possible way.
Similarly Wit et al. [69] discuss the following two questions: i) which modelling procedure will, with sufficient data, identify the true model? or ii) based on the data, which model lies closest to the true model? They conclude indecisive: asking different questions leads to different approaches for model selection. However one is able to go beyond this neutral statement. Consider the following aphorism attributed to G. Box [70]: "all models are wrong". In physics one would not use the term "wrong". Physical models have their range of applica-bility. We know that Newtonian gravity is failing on large scales and we assume that general relativity is failing on very small scales. However both have their range of applicability and we successfully compare their predictions with measurements and observations. Presumably all models in physics, at least the models which may be confronted with data, are effective models (see for example the discussion of effective field theories in [71]). Hence, methods of model selection, which try to identify the true model are deceptive. We know from the outset that our models are "wrong". This is a bit nitpicking, since we know about the range of applicability of our models. Nevertheless it is advisable to respect this situation from the beginning and use the information theoretic approach. There we try to find the best approximating, not necessarily the true model. This becomes especially important in cosmology, where new observations always add to the existing data. For example new observations of galaxies are added to the already known galaxy catalogues. The Universe contains the galaxy distribution and probabilistic physical models are used to describe it (see [72]). Again, we seek the best approximating model. Now consider another argument from the philosophy of science (see also [73]). Bayesian updating is sometimes presented as the only relevant way of plausible reasoning in science (Jaynes [74]). This would favour methods based on the Bayesian evidence and the evidence ratio for model selection. However scientist devise new models and compare them to data. Either the data supports the model or sometimes allows a rejection (falsification). This cycle has been put forward by Popper [75] and refined by Lakatosz [76]. Actually this approach seems to be too restrictive to describe the scientific growth of knowledge as outlined by Feyerabend [77] and Kuhn [78]. Laudan [79] argues that the contextual problem solving effectiveness is the key ingredient for a successful description of scientific progress (compare also with [80]). In other words, we seek the model which offers the most effective way to describe new data. This is the idea behind the information theoretic approach.
Up to now we argued for the information theoretic approach in general but did not differentiate between the classical and the Bayesian version. As we already stated in the introduction we prefer the Bayesian approach if the prior is well specified. We do not want to repeat the arguments exchanged in the discussion of the Bayesian versus the classical approach to statistics. Perhaps the articles by Cousins [2] and Efron [81], including the comments directly following Efron's article, may serve as an introduction to this discussion. A pragmatic reconciliation is suggested by Kass [82] in his "big picture".
As an example we discuss an application in cosmology. From the observations of distant supernovae one can infer the accelerating expansion of the Universe (see [83]). The question we address is wether this data allows for a more detailed look at the expansion history of the universe and specifically if we can decide between the ΛCDM and the wCDM model. Employing the Bayesian information theoretic approach we calculate the κ for both models. With a simple data driven model we calculate the dispersion of this κ and we also investigate the dependence of κ on the prior distribution. Based on the given data we cannot decide wether the ΛCDM or the wCDM model is better. The other methods give no decisive answer, too.
So far we presented methods for the comparison of models based on their ability to fit or predict observational data. There are further criteria we can and should employ to assess physical models. Independent from the observational data, new models (hopefully) make predictions and solve conceptual problems. They can be judged by their effectiveness to solve such problems [79]. This is not a quantitative endeavour, one has to present arguments for and against models, often based on the foundations of the models, or criticising the viability A SOME RESULTS FROM MATHEMATICAL STATISTICS of the approximations used. But in the end physical models have to stand the comparison with data [84]. Then the methods of model selection we discuss come into play. Consider N independent random realisations (x 1 , . . . , x N ) of this random variable. Then the empirical distribution function is defined as Here I A (x) is the indicator function of the set A with I A (x) = 1 if x ∈ A and zero for x ∈ A. The theorem of Glivenko-Cantelli states that F N (x) converges for N → ∞ towards F (x) uniformly, this means F N − F ∞ → 0 almost surely (see for example [88], and Figure 4). For example this theorem guarantees the convergence of the empirical median and quantiles. The empirical distribution function is analogously defined in higher dimensions. The halfinterval is replaced by half open rectangle stretching to infinity with the point x i marking the lower left corner.
Kullback-Leibler divergence: The Kullback-Leibler (KL)-divergence ( [89], also called relative entropy) measures the deviation between the distribution of two random variables with probability densities p(z) and q(z) (or the corresponding cumulative distribution functions). The KLdivergence is not symmetric in its arguments (it is not a distance). For discrete probability distributions the interpretation is straightforward. The information content in the discrete then the KL-divergence measures of the information lost, if the probability distribution q is used to approximate the true probability distribution p. This characterisation carries over to the continuum. Up to a multiplicative constant the KL-divergence is a unique measure of divergence (see [91] for details).

B Details of the implementation
We have chosen the statistical package R [92] as the basic tool for our computations 12 . The results are presented in section 3. Here you find some details about the implementation and the packages we use.
-We calculate the minimum θ * of the χ 2 according to eq. (3) using the function nls from the core of R [92]. The p-values are calculated using the built in χ 2 -distribution function. -For the maximum likelihood estimate θ we use the function mle2 from the package bbmle [94]. With this θ we calculate the maximum value of the likelihood p f (d | θ ) which we use in the computation of the likelihood ratio. And again we use the built in χ 2 -distribution function to calculate the p-value for the likelihood ratio test (see section 2.2). -Since we only consider a one-and a two-dimensional parameter space, we are able to calculate the evidence by direct numerical integration. We use the builtin function integrate and an adaptive multidimensional integration routine hcubature from the package cubature [95]. The direct numerical integration gives similar results compared to the quite noisy and costly results obtained from nested sampling using the package RNested [96]. The BIC (eq. (10)) is calculated using a function provided in the package bbmle [94]. -For the classical information theoretic approach we first calculate the AIC (see eq. (17)) using functions provided in the bbmle package [94]. To go beyond this asymptotic result we calculate η(f ) according to eq. (14). The bias corrected average log-likelihood is η(f ) = η(f )− b(f ), where b(f ) is the bootstrap estimate of the bias b(f ) (see eq. (15)). In section B.1 we give a detailed description of this bootstrap procedure due to Konishi and Kitagawa [43]. We use 100k bootstrap samples to estimate b(f ). -We prepare Markov chains with the function metrop from the mcmc package [97]. For convergence diagnostics and for tuning of the sampler parameters we employ the coda package [98]. Using Gelman and Rubin's convergence diagnostic [99] we see that all our chains converge after at least 1000 steps, even if we start in the extreme points of the parameter range. For the ΛCDM model we build a chain with a length of 15 Mio steps. We average the results over 15 steps and use this batched chain for the MAP estimate and to calculate κ (see next point). For the wCDM model we build a chain with a length of 30 Mio steps and average the results over 30 steps. -We calculate the estimateκ as described in section 2.4.2. In the ΛCDM model we estimate for each data point d i = (z i , µ i ) from one Markov chain.
Here Ω m,l is one state from the Markov chain and L is the length of the chain. Inserting this estimate of E post [p Λ (d i | Ω m )] into eq. (22) we obtain κ Λ as an average over all the data points. We proceed similarly calculating κ w for the wCDM model. You can download an abbreviated version of our code from https://homepages.physik. uni-muenchen.de/~Martin.Kerscher/software/modelselect/ .

B.1 Bootstrap for b(f )
Before we describe the bootstrap procedure of Konishi and Kitagawa [43] we give a more detailed definition of the average bias. We augment the notation from Sect.
The dependence of the estimated parameters θ (d) and the empirical distribution function F T,N,d on the data set d is now explicit. Konishi and Kitagawa [43] propose a bootstrap procedure to estimate b(f ). First generate bootstrap samples d = (x i ,ỹ i ) N i=1 from the data by repeatedly drawing from d with putting back (i.e. sampling from F T,N,d ).