Paper 19 Objective Bayesian Methods for Model Selection: Introduction and Comparison

19.1 Bayes Factors

19.1.1 Basic Framework

Comparing q models for the data \(x\), \[ M_{i} : \mathbf{X} \text { has density } f_{i}\left(\mathbf{x} | \boldsymbol{\theta}_{i}\right), \quad i=1, \ldots, \boldsymbol{q} \] The marginal or predictive densities of \(\mathbf X\), \[ m_{i}(\mathbf{x})=\int f_{i}\left(\mathbf{x} | \theta_{i}\right) \pi_{i}\left(\theta_{i}\right) d \theta_{i} \] The Bayes factor of \(M_j\) to \(M_i\) is given by \[ B_{j i}=\frac{m_{j}(\mathbf{x})}{m_{i}(\mathbf{x})}=\frac{\int f_{j}\left(\mathbf{x} | \boldsymbol{\theta}_{j}\right) \pi_{j}\left(\boldsymbol{\theta}_{j}\right) d \boldsymbol{\theta}_{j}}{\int f_{i}\left(\mathbf{x} | \boldsymbol{\theta}_{i}\right) \pi_{i}\left(\boldsymbol{\theta}_{i}\right) d \boldsymbol{\theta}_{i}} \] Bayes factor is the “odds provided by the data for \(M_j\) versus \(M_i\).” Also called the “weighted likelihood ratio of \(M_j\) to \(M_i\) with priors being theh”weighting functions."

If we involve the model prior \(P\left(M_{j}\right)\), then the posterior probability of model \(M_i\), given the data \(x\), is \[ P\left(M_{i} | \mathbf{x}\right)=\frac{P\left(M_{i}\right) m_{i}(\mathbf{x})}{\sum_{j=1}^{q} P\left(M_{j}\right) m_{j}(\mathbf{x})}=\left[\sum_{j=1}^{q} \frac{P\left(M_{j}\right)}{P\left(M_{i}\right)} B_{j i}\right]^{-1} \] that is, \[ P\left(M_{k} | \boldsymbol{X}_{n}\right)=\frac{P\left(M_{k}\right) \int f_{k}\left(\boldsymbol{X}_{n} | \boldsymbol{\theta}_{k}\right) \pi_{k}\left(\boldsymbol{\theta}_{k}\right) d \boldsymbol{\theta}_{k}}{\sum_{\alpha=1}^{r} P\left(M_{\alpha}\right) \int f_{\alpha}\left(\boldsymbol{X}_{n} | \boldsymbol{\theta}_{\alpha}\right) \pi_{\alpha}\left(\boldsymbol{\theta}_{\alpha}\right) d \boldsymbol{\theta}_{\alpha}} \] which is equivelent to renormalized marginal probabilities, given by \[ \overline{m}_{i}(\mathbf{x})=\frac{m_{i}(\mathbf{x})}{\sum_{j=1}^{q} m_{j}(\mathbf{x})} \]

19.1.2 Motivation for the Bayesian Approach to Model Selection

  • Reason 1: Bayes factors and posterior model probabilities are easy to understand. The interpretation of Bayes factors as odds and directly probabilities. Not the indirectly on p-value.

  • Reason 2: Bayesian model selection is consistent. Rather surprisingly, use of most classical model selection tools, such as p-values, \(C_p\) statistics, and AIC does not guarantee consistency. If true model is not included in entertain models, Bayesian model selection will choose that model among the candidates that is closest to the true model in terms of Kullback_Leibler divergence.

  • Reason 3: Bayesian model selectio procedures are automatic Ockham’s razors.Bayesian procedures naturally penalize model complexity, and need no introduction of a penalty term.

  • Reason 4. The Bayesian approach to model selection is conceptually the same, regardless of the number of models under consideration. 传统的frequency方法对两个模型和多个模型的比较方法不一样,比如两个模型之间的选取使用hypothesis testing,而多个则不太一样。Bayesian方法则比较一致。

  • Reason 5. The Bayesian approach does not require nested models, standard distributions, or regular asymptotics.

  • Reason 6. The Bayesian approach can account for model uncertainty. Classical approach would yield overoptimistic estimates of accuracy. So it is often recommend to use part of data to select model and rest to make estimation and prediction. Bayesian can take all the models into account rather than choose one. This is known as ‘Bayesian model averaging’ to handle model uncertainty.

  • Reason 7. The Bayesian approach can yield optimal conditional frequentist procedures.

19.1.3 Utility Functions and Prediction

Approach model selection from the perspective of decision analysis. Bayesian model averaging is an important methodology in prediction.

Select one specific model method: the largest posterior probability

19.1.4 Motivation for Objective Bayesian Model Selection

Subjective & Objective Bayesian Analysis.

Subjective Bayesian analysis is attractive, but needed elicitations from subject experts.

In the case that one often initially entertains a wide variety of models, and careful subjective specification of prior distributions for all the parameters of all the models is essentially impossible.

19.1.5 Difficulties in Objective Bayesian Model Selection

  • Difficulty 1. Computation can be difficult. Hard to calculate Bayes factor. Need integral, could be difficult in high dimension circumstance. Total number of model under consideration can be enormous.

  • Difficulty 2. When the models have parameter space of differing dimensions, use of improper noninformative priors yields indeterminate answers. 假设两个improper无信息先验\(\pi_i^N\)\(\pi_j^N\) are entertained for models \(M_i\) and \(M_j\). Formal Bayes factor in this case is \(B_{ji}\). If use another improper prior like \(c_{i} \pi_{i}^{N} \text { and } c_{j} \pi_{j}^{N}\), then in this case Bayes factor would be \(\left(c_{j} / c_{i}\right) B_{j i}\). Since the choice of \(c_{j} / c_{i}\) is arbitrary, the Bayes factor is clearly indeterminate.

  • Difficulty 3. Use of ‘vague proper priors’ usually gives bad answers in Bayesian model selection.

这里需要把例子展开写一下。

Example: Suppose we observe \(\mathbf{X}=\left(X_{1}, \dots, X_{n}\right)\) where the \(X_i\) are iid \(\mathcal{N}(0,1)\) prior, with variance \(K\) largel, this is the usual vague proper prior for a normal mean. An easy calculation using the definition of Bayes factor yields: \[ B_{21}=(n K+1)^{-1 / 2} \exp \left(\frac{K n^{2}}{2(1+K n)} \bar{x}^{2}\right) \] 计算过程如下: 由Bayes factor的定义: \[ B_{j i}=\frac{m_{j}(\mathbf{x})}{m_{i}(\mathbf{x})}=\frac{\int f_{j}\left(\mathbf{x} | \boldsymbol{\theta}_{j}\right) \pi_{j}\left(\boldsymbol{\theta}_{j}\right) d \boldsymbol{\theta}_{j}}{\int f_{i}\left(\mathbf{x} | \boldsymbol{\theta}_{i}\right) \pi_{i}\left(\boldsymbol{\theta}_{i}\right) d \boldsymbol{\theta}_{i}} \] 在这个模型中,分母是模型\(M_1\),所以就是 \[ \begin{align} f_1(x)&=\prod _{i=1}^n\frac{1}{\sqrt{2\pi}}exp(-\frac{x_i^2}{2})\\ &=\frac{1}{(2\pi)^{\frac{n}{2}}}exp(-\frac{1}{2}\sum_{i=1}^n x_i^2) \end{align} \]

,分子则是

\[ \begin{align} &\int \prod_{i=1}^n \left [\frac{1}{\sqrt{2\pi}}exp(-\frac{(x_i-\theta)^2}{2})\right ]\frac{1}{\sqrt{2\pi K}}exp(-\frac{\theta^2}{2K})d\theta\\ =& \int \frac{1}{2\pi\sqrt K} exp(- \frac{K\sum _{i=1}^nx_i^2-2\theta n\overline xK+nK\theta^2+\theta^2}{2K})d\theta\\ \end{align} \] 由于有gaussian积分的形式 \[ \int exp(-ay^2+xy)dy=\sqrt{\frac{\pi}{a}}exp(\frac{1}{4a}x^2) \] 所以分子的积分形式有 \[ \begin{align} & \int \frac{1}{(2\pi)^{\frac{n+1}{2}}\sqrt K} exp(- \frac{K\sum _{i=1}^nx_i^2-2\theta n\overline xK+nK\theta^2+\theta^2}{2K})d\theta\\ =& \frac{1}{(2\pi)^{\frac{n+1}{2}}\sqrt K} exp(-\frac{1}{2} \sum _{i=1}^nx_i^2) \int exp(-\frac{-2\theta n\overline x K+(nK+1)\theta^2}{2K})d\theta\\ =& \frac{1}{(2\pi)^{\frac{n+1}{2}} \sqrt K} exp(-\frac{1}{2} \sum _{i=1}^nx_i^2) \sqrt{\frac{2\pi K}{nK+1}}exp(\frac{Kn^2}{2(1+Kn)}\overline x^2)\\ =& \frac{1}{(2\pi)^{\frac{n}{2}}} exp(-\frac{1}{2} \sum _{i=1}^nx_i^2) (nK+1)^{-1/2}exp(\frac{Kn^2}{2(1+Kn)}\overline x^2) \end{align} \] 则Bayes factor是 \[ B_{21}=(n K+1)^{-1 / 2} \exp \left(\frac{K n^{2}}{2(1+K n)} \bar{x}^{2}\right) \]

For large K (or large n), this is roughly \((nK)^{-1/2exp(z^2/2)}\), where \(z=\sqrt n \overline x\). So \(B_{21}\) depends very strongly on K, which is chosen arbitrarily.

In contrast, the usual noninformative prior for \(\theta\) in this situation is \(\pi^N_2(\theta)=1\). The resulting Bayes factor is \(B_{21}=\sqrt{n / 2 \pi} \exp \left(z^{2} / 2\right)\), which is reasonable value. In this case, never use ‘arbitrary’ vague proper priors for model selection, but improper noninformative priors may give reasonable results.

However, this is violate the basic principle of basic criteria. The “proper” prior is essential for the Bayesian factor for its scale and normalising constant problem.

  • Difficulty 4. Even ‘common parameters’ can change meaning from one model to another, so that prior distributions must change in a corresponding fashion. Here is an example.

这个难点主要是在实践中遇到的设计阵一般都不是正交设计,有些解释变量会有相关性,这样模型的coefficient就会有不同的scale of coefficient。

比如说对于如下模型: \[ \begin{aligned} M_{1}: Y=X_{1} \beta_{1}+\varepsilon_{1}, & \varepsilon_{1} \sim \mathcal{N}\left(0, \sigma_{1}^{2}\right) \\ M_{2}: Y=X_{1} \beta_{1}+X_{2} \beta_{2}+\varepsilon_{2}, & \varepsilon_{2} \sim \mathcal{N}\left(0, \sigma_{2}^{2}\right) \end{aligned} \] Y是汽油消费量,\(X_1\)是weight,\(X_2\)是engine size。 Prior density is of the form \(\pi_2(\beta_1,\beta_2,\sigma_2)=\pi_{21}(\beta_1)\pi_{22}(\beta_2)\pi_{23}(\sigma_2)\). Since \(\beta_1\) is ‘common’ to the two models, one frequently sees the same prior, \(\pi_{21}(\beta_1)\) also used for this parameter in \(M_1\).

However, this is not reasonable, since \(\beta_1\) has a very different meaning (and value) under \(M_1\) than under \(M_2\). It is unreasonable to set the same prior to \(beta_1\) in both model. This happens because of the considerable positive correlation between weight and engine size.

Although the design matrix is orthogonal design, it could be an issue for variance estimation. For \(\sigma_1^2\) and \(\sigma_2^2\), one often sees the variances being equated and assigned the same prior, even though it is clear that \(\sigma^2_1\) will typically be larger than \(\sigma^2_2\).

19.1.6 Preview

There are several methods of developing default Bayesian model selection procedures. The Conventional Prior (CP) approach, the Bayes Information Criterion (BIC), the Intrinsic Bayes Factor (IBF) approach, and the Fractional Bayes Factor (FBF) approach.

Conventional Prior approach is the most obvious default Bayesian method can be employed for model selection. This method also is highly recommend by the criteria of Bayesian variable selection.

However, it is not always easy to implement Conventional Prior approach in practice. Another crude approximations to Bayes factor, typified by the Bayesian Information Criterion(BIC) is used.

Concerns over the accuracy and applicability of BIC have resulted some new methods of Bayesian model selection. Two most prominent methods are Fractional Bayes factor approach of O’Hagan(1995) and Intinsic Bayes Factor approach of Berger and Pericchi (1996).

19.2 Objective Bayesian Model Selection Methods, with Illustrations in the Linear Model

4 methods are introduced and discussed. Conventional Prior (CP), Bayes Information Criterion (BIC) the Intinsic Bayes Factor(IBF) and Fractional Bayes Factor (FBF).

19.2.1 Conventional Prior Approach

一般来说,如果使用统一的improper prior, Normalising constant会在Bayes factor中消掉不是问题。但是这种情况只对common + orthogonal 的参数适用。因为common,所以上下都有,orthogonal,这样difficulty 4就不会发生(或者引起的问题就比较小。) Alternative methods for this is using default proper priors (but not vague proper priors) for parameters that would occur in one model but not the other.

那问题来了,如果不用vague proper prior的话,那用什么prior呢。

He presented arguments justifying certain default proper priors, but mostly on a case-by-case basis This line of development has been successfully followed by many others (Zellner and Siow 1980, Berger and Pericchi 1996……[literatures])

Illustration 1 Normal Mean, Jeffreys’ Conventional Prior

The data is \(\mathbf{X}=\left(X_{1}, \dots, X_{n}\right)\) where \(X_i\) are iid \(\mathcal{N}\left(\mu, \sigma_{2}^{2}\right)\) under \(M_2\). Under \(M_1\), \(X_i\) are \(\mathcal N(0,\sigma_1^2)\). Note that, due to the difficulty 4, we differentiate between \(\sigma_1^2\) and \(\sigma_2^2\). However, in this situation the mean and variance can be shown to be orthogonal parameters (i.e. the expected Fisher information matrix is diagonal), in which case Jeffreys argues that \(\sigma_1^2\) and \(\sigma_2^2\) do have the same meaning across models and can be identified as \(\sigma_1^2=\sigma_2^2=\sigma^2\). In this case, Jeffreys suggests that the variances can be assigned the same(improper) noninformative prior \(\pi^J(\sigma)=1/\sigma\), since the indeterminate multiplicative constant for the prior would cancel in the Bayes factor.

But it can’t be done for \(\mu\). This is \(M_2\) only parameter. it needs to be assigned a proper prior. Jeffreys obtains the following desiderata that this proper prior should satisfy: 1. should be centered at zero (i.e. centered at \(M_1\)); 2. have scale \(\sigma\). 3. be symmetric around zero 4. have no moments.

这个have no moments 好奇怪,是为什么呢?

He argues that the simplest distribution that satisfies these conditions is the Cauchy \((0,\sigma^2)\). In summary, Jeffryes’s conventional prior for this problem is: \[ \pi_{1}^{J}\left(\sigma_{1}\right)=\frac{1}{\sigma_{1}}, \quad \pi_{2}^{J}\left(\mu, \sigma_{2}\right)=\frac{1}{\sigma_{2}} \cdot \frac{1}{\pi \sigma_{2}\left(1+\mu^{2} / \sigma_{2}^{2}\right)} \]

好句子: Although this solution appears to be rather ad hoc, it is quite reasonable.

ad hoc可以指临时,特别.

Choosing the scale of the pior for \(\mu\) to be \(\sigma_2\) (the only available non-subjective ‘scaling’ in the problem) and centering it at \(M_1\) are natural choice. And Cauchy priors are known to be robust in various ways. Although it is easy to object to having such choice imposed on the analysis, it is crucial to keep in mind that there is no real default Bayesian alternative here. 这段话没有解释但是很重要所以全文摘录了。对于mu的scale(因为没有moment所以不说方差),可能会引入主观的东西,(虽然为了达成目的,这些选择都很自然),但是很重要的事没有real default Bayesian alternative here. 替代的objective(客观)方法要么对应利用一些(proper)default prior,或者更糟,最后不对应任何真实的Bayes分析。

Illustration 2:

In Zellner and Siow (1980), a generalization of the above conventional Jeffreys prior is suggested for comparing two nested models within the normal linear model. Let \(\mathbf X=[1:\mathbf Z_1:\mathbf Z_2]\) be the design matrix for the ‘full’ linear model.Assumed that the regressors are measured in terms of deviations from their sample means, that is, \(\mathbf{1}^{t} \mathbf{Z}_{j}=0, j=1,2\).Assumed that the model has been parameterized in an orthogonal fashion, so that \(\mathbf{Z}_{1}^{t} \mathbf{Z}_{2}=0\)

Then the corresponding normal linear model \(M_2\) for n observations \(y=\left(y_{1}, \dots, y_{n}\right)^{t}\) is \[ \mathbf{y}=\alpha \mathbf{1}+\mathbf{Z}_{1} \boldsymbol{\beta}_{1}+\mathbf{Z}_{2} \boldsymbol{\beta}_{2}+\boldsymbol{\varepsilon} \] where \(\epsilon \sim \mathcal N_n(0,\sigma^2I_n)\) The dimensions of \(\beta_1\) and \(\beta_2\) are \(k_1-1\) and p, respectively.

For comparison of \(M_2\) with the model \(M_1:\beta_2=0\),Zellner and Siow(1980) propose the following default conventional priors: \[ \begin{aligned} \pi_{1}^{Z S}\left(\alpha, \beta_{1}, \sigma\right) &=1 / \sigma \\ \pi_{2}^{Z S}\left(\alpha, \beta_{1}, \sigma, \beta_{2}\right) &=h\left(\beta_{2} | \sigma\right) / \sigma \end{aligned} \] 第一个是improper prior,第二个是proper prior. \(h(\beta_2|\sigma)\) is the \(\text { Cauchy }_{p}\left(\mathbf{0}, \mathbf{Z}_{2}^{t} \mathbf{Z}_{2} /\left(n \sigma^{2}\right)\right)\) density \[ h\left(\boldsymbol{\beta}_{2} | \sigma\right)=c \frac{\left|\mathbf{Z}_{2}^{t} \mathbf{Z}_{2}\right|^{1 / 2}}{\left(n \sigma^{2}\right)^{p / 2}}\left(1+\frac{\boldsymbol{\beta}_{2}^{t} \mathbf{Z}_{2}^{t} \mathbf{Z}_{2} \boldsymbol{\beta}_{2}}{n \sigma^{2}}\right)^{-(p+1) / 2} \] with \(c=\Gamma[(p+1) / 2] / \pi^{(p+1) / 2}\). Thus the improper priors of the “common” \((\alpha,\beta_1,\sigma)\) are assumed to be the same for the two models (justifiable by orthogonality), while the conditional prior of the (unique to \(M_2\)) parameter \(\beta_2\), given \(\sigma\), is assumed to be the (proper)p-dimensional Cauchy distribution, with location at 0 and covariance matrix \(\mathbf{Z}_{2}^t \mathbf{Z}_{2} /\left(n \sigma^{2}\right)\), “a matrix suggested by the form of the information matrix”, quote from Zellner and Siow(1980).

The nested model case can be solved in the method above, for the case when there are more than two models, or the models are non-nested, Zellner and Siow(1984) utilize what is often called the ‘encompassing’ approach. Wherein one compares each submodel, \(M_i\), to the encompassing model,\(M_0\), that contains all possible covariates from the submodel. The pairwise Bayes factor \(B_{0i},i=1,...,q\). The Bayes factor of \(M_j\) to \(M_i\) is then defined to be \[ B_{j i}=B_{0 i} / B_{0 j} \] .“encompassing” model方法就是找这一堆模型的最大集合作为基准(full)模型然后比较得到Bayes factor。 (这里笔记写的比较简练来着。。。。)

Illustration 2(continued): Linear Model, Conjugate \(g\)-priors

similar model seeting: \[ M: \mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}_{n}\left(\mathbf{0}, \sigma^{2} \mathbf{I}_{n}\right) \] , the \(g-prior\) density is defined by \[ \pi(\sigma)=\frac{1}{\sigma}, \quad \pi(\beta | \sigma) \text { is } \mathcal{N}_{k}\left(\mathbf{0}, g \sigma^{2}\left(\mathbf{X}^{t} \mathbf{X}\right)^{-1}\right) \] sometimes g is estimated by empirical Bayes methods.

\(g-priors\) 最关键的性质是其边际分布,\(m(y)\),有closed form, \[ m(\mathbf{y})=\frac{\Gamma(n / 2)}{2 \pi^{n / 2}(1+g)^{k / 2}}\left(\mathbf{y}^{t} \mathbf{y}-\frac{g}{(1+g)} \mathbf{y}^{t} \mathbf{X}\left(\mathbf{X}^{t} \mathbf{X}\right)^{-1} \mathbf{X}^{t} \mathbf{y}\right)^{-n / 2} \] 这样,比较任意两个线性模型的Bayes factors和后验模型概率就能直接计算。

不幸的是,\(g-priors\)用于模型选择时,有一些不太好的性质。 考虑模型\(M^*:\beta=0\),当\(\hat\beta\) goes to infinity, so that one becomes certain that \(M^*\) is wrong, the Bayes factor of \(M^*\) to \(M\) goes to the nonzero constant \((1+g)^{(k-n) / 2}\).

虽然conventional prior approach在性质上很appealing,但是以上的讨论说明了很难实现,即是在简单的模型linear model上也很难。没有一个普世的方法决定conventional priors.所以接下来说一说比较容易实现的方法。

19.2.2 2.2 Intrinsic Bayes Factor (IBF) approach

q models \(M_1,...,M_q\) noninformative priors \(\pi_i^N(\theta_i),i=1,...,q\). Chosen ‘reference prior’ by Berger and Bernardo 1992.

what is reference prior in this case?

Define the corresponding marginal or predictive densities of \(\mathbf X\), \[ m_{i}^{N}(\mathbf{x})=\int f_{i}\left(\mathbf{x} | \theta_{i}\right) \pi_{i}^{N}\left(\theta_{i}\right) d \theta_{i} \]

General stategy for defining IBF - Definition of a proper and minimal ‘training sample’, viewed as some subset of the entie data x.

Definition. A training sample, \(x(\ell)\), is called proper if \(0<m_{i}^{N}(\mathbf{x}(l))<\infty\) for all \(M_i\), and minimal if it is proper and no subset is proper.

也就是说,思路是先拿最小限度的数据把先验\(\pi_i^N\) 从improper prior通过最小限度数据的likelihood构成的posterior当做proper posterior \(\pi_i^N(\theta_i|x(\ell))\), and then use the latter to define Bayes factors for the remaining data. 结果就是,为了比较\(M_j\)\(M_i\), can be seen (under most circumstances) to be \[ B_{j i}(l)=B_{j i}^{N}(\mathbf{x}) \cdot B_{i j}^{N}(\mathbf{x}(l)) \] where \[ B_{j i}^{N}=B_{j i}^{N}(\mathbf{x})=\frac{m_{j}^{N}(\mathbf{x})}{m_{i}^{N}(\mathbf{x})} \quad \text { and } \quad B_{i j}^{N}(l)=B_{i j}^{N}(\mathbf{x}(l))=\frac{m_{i}^{N}(\mathbf{x}(l))}{m_{j}^{N}(\mathbf{x}(l))} \] 。 来源思路如下:

一般的Bayes Factor: \[ B_{i0}=\frac{m_i(y)}{m_0(y)} \] where \[ \begin{equation} m_i(y)=\int \pi_i(\theta_i)f(y|\theta_i)d\theta_i \\ m_0(y)=\int \pi_0(\theta_0)f(y|\theta_0)d\theta_0 \end{equation} \] 把improper prior \(\pi_i(\theta_i)\),通过\(x(\ell)\),得到后验 \[ \pi_i(\theta_i|x(\ell))=\frac{f(x(\ell)|\theta_i)\pi_i(\theta_i)}{\int f(x(\ell)|\theta_i)\pi_i(\theta_i)d\theta_i} \] 其中分母能写成\(m_i^N(x(\ell))\). 这样 \[ \begin{align*} B_{i0} &=\frac{\int\pi^N_i(\theta_i|x(\ell))f_i(x(-\ell)|\theta_i)d\theta_i}{\int\pi^N_0(\theta_0|x(\ell))f_0(x(-\ell)|\theta_0)d\theta_0}\\ &=\frac{\int\frac{\pi^N_i(\theta_i)f_i(x(\ell)|\theta_i)}{m^N_i(x(\ell))}f_i(x(-\ell)|\theta_i)d\theta_i}{\int\frac{\pi^N_0(\theta_0)f_0(x(\ell)|\theta_0)}{m^N_0(x(\ell))}f_0(x(-\ell)|\theta_0)d\theta_0}\\ &=\frac{\frac{m_i^N}{m_i^N(x(\ell))}}{\frac{m_0^N}{m_0^N(x(\ell))}}\\ &=\frac{m^N_i}{m_0^N}\frac{m_0^N(x(\ell))}{m_i^N(x(\ell))}\\ &=B_{0i}^N(x(\ell))\cdot B_{i0}^N(x) \end{align*} \]

The problem is \(B_{ji}(\ell)\) no longer depends on the scales of \(\pi^N_j\) and \(\pi^N_j\), instead, it does depend on the arbitrary choice of the (minimal) training sample \(x(\ell)\). To eliminate the dependence and to increase stability, we can use several methods like “average” or other similar approach:

\[ B_{j i}^{A I}=B_{j i}^{N} \cdot \frac{1}{L} \sum_{l=1}^{L} B_{i j}^{N}, \quad B_{j i}^{M I}=B_{j i}^{N} \cdot \operatorname{Med}\left[B_{i j}^{N}(l)\right] \] That is, use average or median to approach \(B^N_{0i}(x(\ell))\).

For the AIBF, it is typically neccessary to place the more “complex” model in the numerator. For example, let \(M_j\) be the more complex model, then define \(B_{ij}^{AI}\) by \(B_{ij}^{A I}=1 / B_{j i}^{AI}\). The IBFs defined in \[ P\left(M_{i} | \mathbf{x}\right)=\frac{P\left(M_{i}\right) m_{i}(\mathbf{x})}{\sum_{j=1}^{q} P\left(M_{j}\right) m_{j}(\mathbf{x})}=\left[\sum_{j=1}^{q} \frac{P\left(M_{j}\right)}{P\left(M_{i}\right)} B_{j i}\right]^{-1} \] the posterior probability of \(M_i\). Use IBF in the formula upon, IBFs is a resampling summaries of the evidence of the data for the comparison of models, since in the averages there is sample re-use.

Illustration 1 (continued) Normal Mean, AIBF and MIBF

Start with non-informative prior: \(\pi_{1}^{N}\left(\sigma_{1}\right)=1 / \sigma_{1}\) and \(\pi_{2}^{N}\left(\mu, \sigma_{2}\right)=1 / \sigma_{2}^{2}\). Note that \(\pi_2^N\) is not the reference prior.

What is reference prior? see THE FORMAL DEFINITION OF REFERENCE PRIORS by Berger, Bernardo and Sun.

\[ m_{1}^{N}(\mathbf{x}(l))=\frac{1}{2 \pi\left(x_{i}^{2}+x_{j}^{2}\right)}, \quad m_{2}^{N}(\mathbf{x}(l))=\frac{1}{\sqrt{\pi}\left(x_{i}-x_{j}\right)^{2}} \] Computation yields the Bayes factor for data x, when using \(\pi_1^N\) and \(\pi_2^N\) directly as the prior: \[ B_{21}^{N}=\sqrt{\frac{2 \pi}{n}} \cdot\left(1+\frac{n \bar{x}^{2}}{s^{2}}\right)^{n / 2} \] where \(s^{2}=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\). Then AIBF is then clearly equal to \[ B_{21}^{A I}=B_{21}^{N} \cdot \frac{1}{L} \sum_{l=1}^{L} \frac{\left(x_{1}(l)-x_{2}(l)\right)^{2}}{2 \sqrt{\pi}\left[x_{1}^{2}(l)+x_{2}^{2}(l)\right]} \] and MIBF is given by \[ B_{21}^{M I}=B_{21}^{N} \cdot \operatorname{Med}_{l=1, \ldots, L}\left[\frac{\left(x_{1}(l)-_{2}(l)\right)^{2}}{\left.2 \sqrt{\pi} | x_{1}^{2}(l)+x_{2}^{2}(l)\right]}\right] \] 计算过程是 \[ \begin{align} &\int \pi(\theta) f(x(\ell)|\theta)d\theta\\ =&\int \frac{1}{\sigma_1^2} \mathcal N(x_1,x_2|0,\sigma_1^2)\\ =&\int \frac{1}{\sigma_1} \frac{1}{2\pi\sigma_1^2}exp(-\frac{1}{2\sigma_1^2}(x_1^2+x_2^2))d\sigma_1\\ =&\int \frac{1}{2\pi} \sigma_1^{-3}exp(-\frac{1}{2}\sigma_1^{-2}(x_1^2+x_2^2))d\sigma_1\\ =&\frac{1}{2\pi}\int t^{\frac{3}{2}} t^{-\frac{3}{2}} (-\frac{1}{2})exp(-\frac{x_1^2+x_2^2}{2}t)dt \\ =&\frac{1}{2\pi(x_1^2+x_2^2)} \end{align} \]

19.2.3 2.3 The Fractional Bayes Factor (FBF) Approach

由于这个方法这篇文章的内容一时半会没看懂,于是去看 O’Hagan (1995)的了,为了避免未来(明后天?)写整理的时候忘掉,于是直接先写这部分的内容。

这个方法是从上一节所描述的The Intrinsic Bayes Factor approach发展而来的。为了使minimal dataset构造出的proper prior性质比较好,所以最好用一些方法取遍\(x(\ell)\)。比如说平均,或者中位数等等方法。但是为了简化这个选取过程 Let \(b=m/n\). If both m and n are large, the likelihood \(f_i(y|\theta_i)\) based only n the training sample y will approximate to the full likelihood \(f_i(y|\theta_i)\) raised to the power b. \[ \begin{array}{l}{\qquad B_{b}(\mathbf{x})=q_{1}(b, \mathbf{x}) / q_{2}(b, \mathbf{x})} \\ {\text { where }} \\ {\qquad q_{i}(b, \mathbf{x})=\frac{\int \pi_{i}\left(\theta_{i}\right) f_{i}\left(\mathbf{x} | \theta_{i}\right) \mathrm{d} \theta_{i}}{\int \pi_{i}\left(\theta_{i}\right) f_{i}\left(\mathbf{x} | \theta_{i}\right)^{b} \mathrm{d} \theta_{i}}}\end{array} \] if the improper prior have the form \(\pi_{i}\left(\boldsymbol{\theta}_{i}\right)=c_{i} h_{i}\left(\boldsymbol{\theta}_{i}\right)\), the indeterminate constant \(c_i\) cancels out, leaving \[ q_{i}(b | \mathbf{x})=\frac{\int h_{i}\left(\boldsymbol{\theta}_{i}\right) f_{i}\left(\mathbf{x} | \boldsymbol{\theta}_{i}\right) \mathrm{d} \boldsymbol{\theta}_{i}}{\int h_{i}\left(\boldsymbol{\theta}_{i}\right) f_{i}\left(\mathbf{x} | \boldsymbol{\theta}_{i}\right)^{b} \mathrm{d} \boldsymbol{\theta}_{i}} \]

所以上标\(^b\)就是字面意思power b,而不是标识。