Paper 18 Criteria for Bayesian Model Choice With Application to Variable Selection

18.1 Abstract:

在主观Bayesian model selection中,没有任何一个方法在决定先验分布中占主导地位。

18.2 Introduction

18.2.1 Background

“objective model selection priors”: not subjective prior.

There has been no agreement as to which are most appealing or most successful.

18.2.2 Notation.

y be a data vector of size n. \[ M_{0} : f_{0}(\mathbf{y} | \boldsymbol{\alpha}), \quad M_{i} : f_{i}\left(\mathbf{y} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right), \quad i=1,2, \ldots, N-1 \]

\(\alpha\)是intercept,\(\beta_i\)\(k_i\)维的参数。 null model, prior is \(\pi_0(\alpha)\).

Model selection prior: \[ \pi_{i}\left(\boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right)=\pi_{i}(\boldsymbol{\alpha}) \pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right) \] 然后\(M_i\)的后验概率可以写成: \[ \operatorname{Pr}\left(M_{i} | \mathbf{y}\right)=\frac{B_{i 0}}{1+\left(\sum_{j=1}^{N-1} B_{j 0} P_{j 0}\right)} \] 式中的\(P_{j0}\)是先验odds (prior ratio) \(P_{j 0}=\operatorname{Pr}\left(M_{j}\right) / \operatorname{Pr}\left(M_{0}\right)\)也就是\(M_j\)的先验除以null model的先验。 \(B_{j0}\)是模型\(M_j\)关于\(M_0\)的Bayes factor,定义是 \[ B_{j 0}=\frac{m_{j}(\mathbf{y})}{m_{0}(\mathbf{y})} \quad \text { with } m_{j}(\mathbf{y})=\int f_{j}\left(\mathbf{y} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right) \pi_{j}\left(\boldsymbol{\alpha}, \boldsymbol{\beta}_{j}\right) d \boldsymbol{\alpha} d \boldsymbol{\beta}_{j} \]

\(m_{0}(\mathbf{y})=\int f_{0}(\mathbf{y} | \boldsymbol{\alpha}) \pi_{0}(\boldsymbol{\alpha}) d \boldsymbol{\alpha}\)\(M_j\)模型的边际似然函数,模型\(M_j\)\(M_0\)对应的模型先验是\(\pi_{j}\left(\boldsymbol{\alpha}, \boldsymbol{\beta}_{j}\right)\)\(\pi_{0}(\boldsymbol{\alpha})\). 而这篇文章主要关注于讨论\(\pi_{j}\left(\boldsymbol{\alpha}, \boldsymbol{\beta}_{j}\right)\)\(\pi_{0}(\boldsymbol{\alpha})\)的选取。

18.3 Criteria for objective model selection priors.

Jeffreys’s desiderate: precursors to the criteria developed herein.

这篇文章把prior的criteria分成四类:basic,consistency criteria, predictive matching criteria and invariance criteria.

18.3.1 Basic criteria:

一般来说prior for \(\beta_i\)应该是proper的,因为这些只出现在Bayes factors \(B_{i0}\) 的分子中,如果使用了improper prior,improper prior对应的任意常数就没法消去了,使得\(B_{i0}\)不是良定的。

基于同样的原因,模糊的proper prior不能在\(B_{i0}\)中使用。因为这个任意的模糊量会在Bayes factor中以乘的形式体现,使得Bayes factor也变成一个任意的量。 所以有以下准则: Each conditional prior \(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right)\) must be proper(integrating to one) and cannot be arbitrarily vague in the sense of almost all of its mass being outside any believable compact set.

18.3.2 Consistency criteria

Following Liang et al(2008)考虑两个主要的consistency criteria: model selection consistency and information consistency:

Criterion 2(Model selection consistency) If data y have been generated by \(M_i\), then the posterior probability of \(M_i\) should converge to 1 as the sample size \(n\rightarrow\infty\).

模型选择一致性不是特别有争议性,虽然有些时候真实模型并不是所有备选的模型中的一个。 然后有一系列文献关心这个准则:

Fern´andez, Ley and Steel (2001),
Berger, Ghosh and Mukhopadhyay (2003), Liang et al. (2008), Casella et al.
(2009), Guo and Speckman (2009).

Criterion 3(Information consistency): For any model \(M_i\), if \(\{y_m,m=1,...\}\) is a sequence of data vectors of fixed size such that, as \(m\rightarrow\infty\), \[ \Lambda_{i 0}\left(\mathbf{y}_{m}\right)=\frac{\sup _{\boldsymbol{\alpha}, \boldsymbol{\beta}_{i}} f_{i}\left(\mathbf{y}_{m} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right)}{\sup _{\boldsymbol{\alpha}} f_{0}\left(\mathbf{y}_{m} | \boldsymbol{\alpha}\right)} \rightarrow \infty \quad \text { then } B_{i 0}\left(\mathbf{y}_{m}\right) \rightarrow \infty \] 在正态误差的线性模型中,这个criteria等价于,如果考虑一系列的数据向量,对应的F(或t)统计量趋于无穷,那么Bayes factor也应该是一样的。

Jeffreys(1961)使用了这个思路去验证Cauchy prior测试正态的均值是0, also been highlighted in

Berger and
Pericchi (2001), Bayarri and Garc´ıa-Donato (2008), Liang et al. (2008).

可以构造一个Bayesian answer violates information consistency的例子,但是这个例子是非常小的sample sizes,然后先验有极端平的尾部。更多的,违反information consistency 会让频率和bayesian有非常大的冲突,可以被看成unattractive.

第三类consistency用来表述objective model selection priors typically depend on specific featues of the model. 比如sample size或者考虑特殊的covariates。

Criterion 4(Intrinsic prior consistency) Let \(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}, n\right)\) denote the prior for the model specific parameters of model \(M_i\) with sample size n. Then as \(n\rightarrow \infty\) and under suitable conditions on the evolution of the model with n, \(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}, n\right)\) should converge to a proper prior \(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right)\)

思想是模型的特征和样本大小frequently affect model selection priors, 比如一些特征应该在n很大时消失。如果存在一个这样的极限先验,就叫intrinsic prior, Berger and Pericchi(2001).

18.3.3 2.4 Predictive matching criteria.

如果一个先验的量级差了2,在一维的情况不会有很大区别,但是在50维的情况下,就会导致Bayes factor上差\(2^{50}\)

Jeffreys 定义了 “minimal sample size” logically be unable to discriminate between two hypotheses, and argued that the prior distributions should be chosen to the yield equal marginal likelihoods for the two hypotheses.

如果无法区分的两个hypotheses,那么选取的先验分布在这两个hypotheses下也应该得到相等的边际似然。

例子:

\[ M_{1} : y \sim \frac{1}{\sigma} p_{1}\left(\frac{y-\mu}{\sigma}\right) \quad \text { and } \quad M_{2} : y \sim \frac{1}{\sigma} p_{2}\left(\frac{y-\mu}{\sigma}\right) \]

Definition 1. The model/prior pairs \(\{M_i,\pi_i\}\) and \(\{M_j,\pi_j\}\) are predictive matching at sample size \(n^*\) if the predictive distributions \(m_{i}\left(\mathbf{y}^{*}\right)\) and \(m_{j}\left(\mathbf{y}^{*}\right)\) are close in terms of some distance measure for data of that sample size. The model/prior pairs \(\{M_i,\pi_i\}\) and \(\{M_i,\pi_i\}\) and \(\{M_j,\pi_j\}\) are exact predictive matching at sample size \(n^*\) if \(m_{i}\left(\mathbf{y}^{*}\right)=m_{j}\left(\mathbf{y}^{*}\right)\) for all \(\mathbf{y}^*\) of sample size \(n^*\).

Predictive matching at sample size \(n^*\). 一组模型和先验被称为 predictive matching at sample size \(n^*\),如果预测分布\(m_i(\boldsymbol y^*)\)\(m_j(\boldsymbol y^*)\)基于此样本量,距离测度非常靠近。如果是 exact predictive matching at sample size \(n^*\) 如果 \(m_{i}\left(\mathbf{y}^{*}\right)=m_{j}\left(\mathbf{y}^{*}\right)\). 也就是之前是靠的很近,exact是完全相等。

Definition 2 (Minimal training sample). A minimal training sample \(\boldsymbol y_i^*\) for \(\{M_i,\pi_i\}\) is a sample of minimal size \(n^*_i\geq 1\) with a finite nonzero marginal density \(m_i(\boldsymbol y_i^*)\).

Definition 3 (Null predictive matching). The model selection priors are null predictive matching if each of the model/prior pairs \(\{M_i,\pi_i\}\) and \(\{M_0,\pi_0\}\) are exact predictive matching for all minimal training samples \(\boldsymbol y^*_i\) for \(\{M_i,\pi_i\}\)

就是把j改成对null model成立。

Definition 4 (Dimensional predictive matching). The model selection priors are dimensional predictive matching if each of the model/prior pairs \(\{M_i.\pi_i\}\) and \(\{M_j,\pi_j\}\) of the same complexity/dimension (i.e., \(k_i=k_j\)) are exact predictive matching for all minimal training samples \(\boldsymbol y_i^*\) for models of that dimension.

只要是同样维度的模型,就match,是dimensional predictive matching所描述的问题。

18.3.4 Invariance criteria

model selection is invariance to the units of measurement being used:

Criterion 6(Measurement invariance). The units of measurement used for the observations or model parameters should not affect Bayesian answers.

More generally, model structures are invariant to group fransformations.

Definition 5. The family of densities for \(\mathbf{y} \in \mathbb{R}^{n}, \mathfrak{F} :=\{f(\mathbf{y} | \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\}\) is said to be invariant under the group of transformations \(G :=\left\{g : \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\right\}\) if, for every \(g \in \mathfrak{G}\) and \(\theta \in \Theta\), there exists a unique \(\theta^*\in \Theta\) such that \(X=g(Y)\) has density \(f\left(\mathbf{x} | \boldsymbol{\theta}^{*}\right) \in \mathfrak{F}\). In such a situation, \(\theta^*\) will be denoted \(\overline g(\theta)\).

Criterion 7(Group invariance). If all models are invariant under a group of transformations \(G_0\), then the conditional distributions, \(\pi(\beta_i|\alpha)\), should be chosen in such a way that the conditional marginal distributions \[ f_{i}(\mathbf{y} | \boldsymbol{\alpha})=\int f_{i}\left(\mathbf{y} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right) \pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right) d \boldsymbol{\beta}_{i} \]

18.4 Objective prior distributions for variable selection in normal linear models.

18.4.1 Introduction

Variable selection in normal linear models.

\[ \begin{aligned} M_{0} : f_{0}\left(\mathbf{y} | \boldsymbol{\beta}_{0}, \sigma\right) &=\mathcal{N}_{n}\left(\mathbf{y} | \mathbf{X}_{0} \boldsymbol{\beta}_{0}, \sigma^{2} \mathbf{I}\right) \\ M_{i} : f_{i}\left(\mathbf{y} | \boldsymbol{\beta}_{i}, \boldsymbol{\beta}_{0}, \sigma\right) &=\mathcal{N}_{n}\left(\mathbf{y} | \mathbf{X}_{0} \boldsymbol{\beta}_{0}+\mathbf{X}_{i} \boldsymbol{\beta}_{i}, \sigma^{2} \mathbf{I}\right), \quad i=1, \ldots, 2^{p}-1 \end{aligned} \]

Zellner and Siow (1980) common objective estimation priors for \(\alpha\) and multivariate Cauchy priors for \(\pi_i(\beta_i|\alpha)\), centered at zero and with prior scale matrix \(\sigma^{2} n\left(\mathbf{X}_{i}^{\prime} \mathbf{X}_{i}\right)^{-1}\); Zellner (1986) use similar scale matrix called “g-prior”.

18.4.2 Proposed prior (the “robust prior”)

Under model \(M_i\), the prior is of the form \[ \begin{aligned} \pi_{i}^{R}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right) &=\pi\left(\boldsymbol{\beta}_{0}, \sigma\right) \times \pi_{i}^{R}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right) \\ &=\sigma^{-1} \times \int_{0}^{\infty} \mathcal{N}_{k_{i}}\left(\boldsymbol{\beta}_{i} | \mathbf{0}, g \boldsymbol{\Sigma}_{i}\right) p_{i}^{R}(g) d g \end{aligned} \] where \(\mathbf{\Sigma}_{i}=\operatorname{Cov}\left(\hat{\boldsymbol{\beta}}_{i}\right)=\sigma^{2}\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\) is the covariance of the maximum likelihood estimator of \(\beta_i\), with \[ \mathbf{V}_{i}=\left(\mathbf{I}_{n}-\mathbf{X}_{0}\left(\mathbf{X}_{0}^{t} \mathbf{X}_{0}\right)^{-1} \mathbf{X}_{0}^{t}\right) \mathbf{X}_{i} \] and \[ p_{i}^{R}(g)=a\left[\rho_{i}(b+n)\right]^{a}(g+b)^{-(a+1)} 1_{\left\{g>\rho_{i}(b+n)-b\right\}} \] \[ \text { with } a>0, b>0 \quad \text { and } \quad \rho_{i} \geq \frac{b}{b+n} \]

Under these condition, the \(p_i^R(g)\) is a proper density, and \(g\) is positive, so that \(\pi_i^R(\beta_i|\beta_0,\sigma)\) is proper, satisfying the first part of the Basic criterion.

\(p_i^R(g)\)是proper prior

\[ a[\rho_i(b+n)]^a(g+b)^{-(a+1)}1_{\{g>\rho_i(b+n)-b\}} \] with \(a>0,b>0\) and \(\rho_i\geq\frac{b}{b+n}\). 所以 \(\rho_i(b+n)-b\geq 0\) 然后对g的prior进行积分的话,积分下限就从0,变成\(\rho_i(b+n)-b\geq 0\).

\[ \begin{align} &\int _{\rho_i(b+n)-b}^\infty a[\rho_i(b+n)]^a (g+b)^{-(a+1)}dg\\ =& a[\rho_i(b+n)]^a \left (-\frac{1}{a}(g+b)^{-a}\right)|^{\infty}_{\rho_i(b+n)-b}\\ =&a[\rho_i(b+n)]^a\left(\frac{1}{a}(\rho_i(b+n))^{-a}\right)\\ =&1 \end{align} \] 所以是proper prior. 但是\(\beta_i\)的积分为1应该怎么写。。。 \[ \begin{align} &\int ^{\infty}_{-\infty} \sigma^{-1} \left( \int _{0}^{\infty} \mathcal{N}_{k_{i}}\left(\boldsymbol{\beta}_{i} | \mathbf{0}, g \mathbf{\Sigma}_{i}\right) p_i^R(g)dg \right)d\boldsymbol\beta_i\\ =& \int ^{\infty}_{-\infty} \sigma^{-1} \left( \int _{0}^{\infty}(2\pi)^{-\frac{k_i}{2}}|g\Sigma|^{-\frac{1}{2}}exp(-\frac{1}{2}(\beta_i)^T(g\Sigma_i)^{-1}(\beta_i))\\ \cdot a\left[\rho_{i}(b+n)\right]^{a}(g+b)^{-(a+1)} 1_{\left\{g>\rho_{i}(b+n)-b\right\}} dg\right)d\beta_i \end{align} \]

Result 1. The conditional marginals \[ f_{i}\left(\mathbf{y} | \boldsymbol{\beta}_{0}, \sigma\right)=\int \mathcal{N}_{n}\left(\mathbf{y} | \mathbf{X}_{0} \boldsymbol{\beta}_{0}+\mathbf{X}_{i} \boldsymbol{\beta}_{i}, \sigma^{2} \mathbf{I}\right) \pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right) d \boldsymbol{\beta}_{i} \] are invariant under \(G_0\) if and only if \(\pi(\beta_i|\beta_0,\sigma)\) has the form:\(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right)=\sigma^{-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right)\),\(h_i\) is any proper density with support \(\mathcal R^k\). For example, the robust prior is a particular case with \[ h_{i}^{R}(\mathbf{u})=\int \mathcal{N}_{k_{i}}\left(\mathbf{u} | \mathbf{0}, g\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\right) p_{i}^{R}(g) d g. \] 基于Group invariance criterion, 以上结论导出了,条件on common parameters \(\beta_0\) and \(\sigma\), \(\beta_i\) must be scaled by \(\sigma\), centered at zero and not depend on \(\beta_0\). Robust prior 满足Group invariance criterion.

right-Haar density for the common parameters \((\beta_0,\sigma)\), namely \[ \pi_{i}\left(\boldsymbol{\beta}_{0}, \sigma\right)=\pi^{H}\left(\boldsymbol{\beta}_{0}, \sigma\right)=\sigma^{-1} \]

The overall model prior would be of the form \[ \pi_{i}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right)=\sigma^{-1-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right) \]

Result 2. For \(M_i\), let the prior \(\pi_i(\beta_0,\beta_i,\sigma)\) be of the form like previous part, where \(h_i\) is symmetric about zero. Then all model/prior pairs \(\{M_i,\pi_i\}\) are exact predictive matching for \(n^*=k_0+1\).

Group invariance criterion and Predictive matching criterion imply that model selection priors should be of the form \(\pi_{i}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right)=\sigma^{-1-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right)\), with \(h_i\) symmetric about zero. Robust prior satisfies these criteria as robust prior is clearly symmetric about zero. 任何scale of Normal mixture 也满足这些criteria, 因为 \(h(\cdot)\) 是关于0对称的。

18.4.2.1 Justification of the prior for the model specific parameters

Result 3. For \(M_i\), let the prior be as in \(\pi_{i}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right)=\sigma^{-1-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right)\) form, where \(h_i\) is the scale mixture of normals in \(h_{i}^{R}(\mathbf{u})=\int \mathcal{N}_{k_{i}}\left(\mathbf{u} | \mathbf{0}, g\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\right) p_{i}^{R}(g) d g\). The priors are then null predictive matching and dimensional predictive matching for samples of size \(k_0+k_i\), and no choice of the conditional scale matrix other than \(\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\) (or a multiple) can achieve this predictive matching.

这里是surprising 的结论,predictive matching result for larger sample sizes \((k_0+k_i)\) than are encountered in typical predictive matching results. 这个结论只在scale matrices proportional to \(\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\)时成立。

18.4.3 Choosing the hyperparameters for \(p_i^R(g)\)

18.4.3.1 Introduction.

Bayes factor of \(M_i\) to \(M_0\) can be expressed by \[ B_{i 0}=Q_{i 0}^{-\left(n-k_{0}\right) / 2} \frac{2 a}{k_{i}+2 a}\left[\rho_{i}(n+b)\right]^{-k_{i} / 2} \mathrm{AP}_{i} \] in this case, \(AP_i\) is the hypergeometric function of two variables, or Apell hypergeometric function \[ \mathrm{AP}_{i}=\mathrm{F}_{1}\left[a+\frac{k_{i}}{2} ; \frac{k_{i}+k_{0}-n}{2}, \frac{n-k_{0}}{2} ; a+1+\frac{k_{i}}{2} ; \frac{(b-1)}{\rho_{i}(b+n)} ; \frac{b-Q_{i 0}^{-1}}{\rho_{i}(b+n)}\right] \] and \(Q_{i 0}=\mathrm{SSE}_{i} / \mathrm{SSE}_{0}\) is the ratio of the sum of squared errors of models \(M_i\) and \(M_0\).

有一个closed form of Bayes factor不是必须的,但是当处理\(2^p\)个模型的问题时,一个closed form of Bayes factor会非常有用。

建议的超参数是\(a=1/2,b=1\) and \(\rho_{i}=\left(k_{i}+k_{0}\right)^{-1}\).

18.4.3.2 Implications of the consistency criteria.

一致性准则提供了一个值得考虑的知道关于如何选取 a,b 和 \(\rho_i\). 特别的,可以导出如下结论

Result 4. The three consistency criterion are satisfied by the robust prior if a and \(\rho_i\) do not depend on n, \(\lim _{n \rightarrow \infty} \frac{b}{n}=c \geq 0\),\(\lim _{n \rightarrow \infty} \rho_{i}(b+n)=\infty \text { and } n \geq k_{i}+k_{0}+2 a\).

Use of model selection consistency

假设\(M_i\) 是真实模型,考虑任意其他模型\(M_j\).一个关键性的假设关于模型选择一致性是,asymptotically, the design matrices are such that the models are differentiated, in the sense that \[ \lim _{n \rightarrow \infty} \frac{\boldsymbol{\beta}_{i}^{t} \mathbf{V}_{i}^{t}\left(\mathbf{I}-\mathbf{P}_{j}\right) \mathbf{V}_{i} \boldsymbol{\beta}_{i}}{n}=b_{j} \in(0, \infty) \] where \(\mathbf{P}_{j}=\mathbf{V}_{j}\left(\mathbf{V}_{j}^{t} \mathbf{V}_{j}\right)^{-1} \mathbf{V}_{j}^{t}\).

Result 5. Suppose the formula upon is satisfied and that the priors \(\pi_i(\beta_0,\beta_i,\sigma)\) are of the form (15), with \(h_{i}(\mathbf{u})=\int \mathcal{N}_{k_{i}}\left(\mathbf{u} | \mathbf{0}, g\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\right) p_{i}(g) d g\). If the \(p_i(g)\) are proper densities such that \[ \lim _{n \rightarrow \infty} \int_{0}^{\infty}(1+g)^{-k_{i} / 2} p_{i}(g) d g=0 \] model selection consistency will result.

Corollary 1. The prior distributions in \(\begin{aligned} \pi_{i}^{R}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right) &=\pi\left(\boldsymbol{\beta}_{0}, \sigma\right) \times \pi_{i}^{R}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right) \\ &=\sigma^{-1} \times \int_{0}^{\infty} \mathcal{N}_{k_{i}}\left(\boldsymbol{\beta}_{i} | \mathbf{0}, g \boldsymbol{\Sigma}_{i}\right) p_{i}^{R}(g) d g \end{aligned}\) are model selection consistent if \[ \lim _{n \rightarrow \infty} \rho_{i}(b+n)=\infty \]

18.4.3.3 choice of hyper-parameters

b=1 has a notable computational advantage.

看不下去了。以后再看吧

18.5 总结

重新理解摘要:在客观贝叶斯模型选择问题中,没有哪一个基准在决定客观先验时占主导地位。事实上,很多准则分别被提出来,并且应用在寻找合适的先验上。这篇文章formalize大部分常见的准则,并且放在一起形成了新的准则。结果导出新的客观模型选择先验拥有很多良好的性质。

而且正常情况improper prior不能直接应用,导致无法简单的找到一个“vague proper priors”.(由于normalising constant的随意性)

这篇文章使用的model selection 方法基于Bayes factor,使用后验概率来衡量模型的好坏: \[ \operatorname{Pr}\left(M_{i} | \mathbf{y}\right)=\frac{B_{i 0}}{1+\left(\sum_{j=1}^{N-1} B_{j 0} P_{j 0}\right)} \]

然后给出了几个基本准则

    1. 基础准则: conditional prior \(\pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right)\) 必须是proper。
    1. 模型选择的一致性:当sample size \(n\rightarrow \infty\)时,如果数据来自模型\(M_i\),那么\(M_i\)的后验概率应该收敛到1.
    1. 信息一致性:一个数据向量的序列\(\{\boldsymbol y_m,m=1,...\}\),维度固定,当\(m\rightarrow \infty\),有 \[ \Lambda_{i 0}\left(\mathbf{y}_{m}\right)=\frac{\sup _{\boldsymbol{\alpha}, \boldsymbol{\beta}_{i}} f_{i}\left(\mathbf{y}_{m} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right)}{\sup _{\boldsymbol{\alpha}} f_{0}\left(\mathbf{y}_{m} | \boldsymbol{\alpha}\right)} \rightarrow \infty \quad \text { then } B_{i 0}\left(\mathbf{y}_{m}\right) \rightarrow \infty \] 等价的来说,一系列数据对应的F或者t统计量趋向于无穷,那么Bayes factor也一样。这个准则的目的是避免Bayes方法和frequency方法有冲突。一般而言我们往往不会关心和frequency方法有冲突的先验。
    1. intrinsic prior consistency,如果一个先验和样本量n有关,那么当\(n\rightarrow\infty\) 时,这个先验应该收敛到一个proper prior。
    1. Predictive matching criteria: 很复杂,简单来说我的理解就是,如果数据量较小\(<n^*\),没法把两个模型分开的话,就称这两个模型在样本量\(n^*\)上predictive matching at sample size \(n^*\).

模型选择先验应该对 appropriately defined “minimal sample size” in comparing \(M_i\) with \(M_j\)满足predictive matching.

    1. Measurement invariance: 观测的单位和模型参数的单位不应该影响Bayes选择的结论。
    1. Group invariance,如果全部模型关于一个group of transformations \(G_0\) invariant,那么条件分布 \(\pi_i(\beta_i|\alpha)\)用某种方式选择使得条件边际分布 \[ f_{i}(\mathbf{y} | \boldsymbol{\alpha})=\int f_{i}\left(\mathbf{y} | \boldsymbol{\alpha}, \boldsymbol{\beta}_{i}\right) \pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\alpha}\right) d \boldsymbol{\beta}_{i} \] 也关于\(G_0\)不变。

推荐的先验:或者叫“robust prior”: 有这个形式: \[ \begin{aligned} \pi_{i}^{R}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right) &=\pi\left(\boldsymbol{\beta}_{0}, \sigma\right) \times \pi_{i}^{R}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right) \\ &=\sigma^{-1} \times \int_{0}^{\infty} \mathcal{N}_{k_{i}}\left(\boldsymbol{\beta}_{i} | \mathbf{0}, g \boldsymbol{\Sigma}_{i}\right) p_{i}^{R}(g) d g \end{aligned} \] 同时也是g-prior,就是正态先验然后在方差那多个g,然后把g积掉。 \(\boldsymbol{\Sigma}_{i}=\operatorname{Cov}\left(\hat{\boldsymbol{\beta}}_{i}\right)=\sigma^{2}\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\) 是covariance of the MLE for \(\beta_i\) with \[ \mathbf{V}_{i}=\left(\mathbf{I}_{n}-\mathbf{X}_{0}\left(\mathbf{X}_{0}^{t} \mathbf{X}_{0}\right)^{-1} \mathbf{X}_{0}^{t}\right) \mathbf{X}_{i} \] and \[ p_{i}^{R}(g)=a\left[\rho_{i}(b+n)\right]^{a}(g+b)^{-(a+1)} 1_{\left\{g>\rho_{i}(b+n)-b\right\}} \] with \[ a>0, b>0 \quad \text { and } \quad \rho_{i} \geq \frac{b}{b+n}. \]

这个先验有很好的性质,比如其尾部表现和多元student分布一致。而先验尾部的厚度又和information consistency criteria息息相关。

这里提出了一个更一般的形式: \[ \pi_{i}\left(\boldsymbol{\beta}_{i} | \boldsymbol{\beta}_{0}, \sigma\right)=\sigma^{-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right) \] robust prior是这个形式下的一种特殊情况 \[ h_{i}^{R}(\mathbf{u})=\int \mathcal{N}_{k_{i}}\left(\mathbf{u} | \mathbf{0}, g\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}\right) p_{i}^{R}(g) d g \] 而对于高斯模型的条件边际分布,是\(G_0\)不变当且仅当先验有上文的那个general形式。

如果先验有形式: \[ \pi_{i}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right)=\sigma^{-1-k_{i}} h_{i}\left(\frac{\boldsymbol{\beta}_{i}}{\sigma}\right) \] 那么当先验有如上形式时,任意模型和先验对\(\{M_i,\pi_i\}\)是exacet predictive matching for \(n^*=k_0+1\)

既然有那么好的形式,那另外一个问题也自然而然的产生了,对于robust prior, 超参数该如何选取?

对于这个问题,consistency criteria可以指导超参数的选取。

然后根据这个可以得出a,b,\(\rho_i\)该满足的一系列条件。

然而有了条件还不够,到底具体的该如何选取?

满足一定的条件下, \[ a=1/2 \] .

b应该取的让\(b/n->c=0\),这样会比较flat tailed. 总的来说 \[ b=1 \]

\(\rho_i\)的选取应该是最复杂的,

\[ \left.\rho_{i} \text { must be a constant (independent of } n\right) \text { and } \rho_{i} \geq 1 /\left(1+k_{0}+k_{i}\right) \]

…. The choice \(\rho_{i}=1 /\left(k_{0}+k_{i}+1\right)\) is the minimum value of \(\rho_i\) and is, hence, certainly a candidate.

\(\rho_i=1/(k_0+k_i)\).

总结,变量选择的方法:

先验: \[ \pi_{i}^{R}\left(\boldsymbol{\beta}_{0}, \boldsymbol{\beta}_{i}, \sigma\right)=\sigma^{-1} \times \int_{0}^{\infty} \mathcal{N}_{k_{i}}\left(\boldsymbol{\beta}_{i} | \mathbf{0}, g \mathbf{\Sigma}_{i}\right) p_{i}^{R}(g) d g \]

\[ \begin{array}{l}{\text { where } \boldsymbol{\Sigma}_{i}=\sigma^{2}\left(\mathbf{V}_{i}^{t} \mathbf{V}_{i}\right)^{-1}, \mathbf{V}_{i}=\left(\mathbf{I}_{n}-\mathbf{X}_{0}\left(\mathbf{X}_{0}^{t} \mathbf{X}_{0}\right)^{-1} \mathbf{X}_{0}^{t}\right) \mathbf{X}_{i}, \text { and }} \\ {\qquad p_{i}^{R}(g)=\frac{1}{2}\left[\frac{(1+n)}{\left(k_{i}+k_{0}\right)}\right]^{1 / 2}(g+1)^{-3 / 2} 1_{\left\{g>\left(k_{i}+k_{0}\right)^{-1}(1+n)-1\right\}}}\end{array} \] 那么Bayes factor就有closed form: \[ \begin{aligned} B_{i 0}=&\left[\frac{n+1}{k_{i}+k_{0}}\right]^{-k_{i} / 2} \\ & \times \frac{Q_{i 0}^{-\left(n-k_{0}\right) / 2}}{k_{i}+1} _2 F_{1}\left[\frac{k_{i}+1}{2} ; \frac{n-k_{0}}{2} ; \frac{k_{i}+3}{2} ; \frac{\left(1-Q_{i 0}^{-1}\right)\left(k_{i}+k_{0}\right)}{(1+n)}\right] \end{aligned} \] 其中 \(_{2} F_{1}\)是standard hypergeometric function. and \(Q_{i 0}=\mathrm{SSE}_{i} / \mathrm{SSE}_{0}\)