贝叶斯推导

Bayesian Inference

Let X1X1 be the vector of observable random variables. Let X2X2 be the vector of latent random variables. Let ΘΘ be the vector of parameters. f(x2,θ|x1)=f(x1|x2,θ)f(x2|θ)f(θ)f(x1)

极大似然估计Maximum Likelihood Estimation

假设有一堆独立同分布数据X1,…,Xn，其PDF为p(x;θ)，其中θ为模型参数，则其似然函数为: $$Ln(\theta)=\prod\limits{i=1}^n p(Xi; \theta)$$而极大似然估计就是要找到参数$\theta$，使得似然函数的值最大。这意思就是找到一个参数$\theta$，使得使用分布p(x;θ)来估计这一堆数据Xi的效果最好。为啥捏，因为假设X都是离散值的情况下，Ln(Xi;θ)表达的含义是：从参数θ通过模型p(x;θ)产生这一堆数据的概率(把所有单个数据产生的概率乘起来就是产生这一堆数据的概率)。所以p(x;θ)=Pθ(x={Xi})，那么如果当有两个参数θ1和θ2时，$P{\theta1}(x={X_i})>P{\theta2}(x={X_i})$，则说明θ1更好的描述了这组数据，因此要找到一个θ使得整似然函数的值最大！所以只要将似然函数对θ求导，就可以找到这样的θ。例子：求N次伯努利分布的最大似然估计: $$ Bern(x|\mu)=\mu^x(1-\mu)^{1-x} \ L(X|\mu)=\prod\limits{i=1}^N \mu^{X*i}(1-\mu)^{1-X_i}=\mu^S(1-\mu)^{N-S}其中S=\sum\limits*{i=1}^N X_i $$将$log L(X|\mu)$对$\mu$求导得到$\frac{S}{\mu}-\frac{N-S}{1-\mu}=0$，最终得到: $$ \hat{\mu}_N=\frac{1}{N} S=\bar{X}_N $$

Maximum A Posterior Estimation |极大后验估计

极大后验估计中加入了一些先验知识，它最大化的是一个后验函数。具体来说，因为贝叶斯定律:

$$ p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)} $$

那么极大后验估计就是要求

$$ \hat{\theta}{MAP}=\underset{\theta}{argmax}~ p(x|\theta)p(\theta)=\underset{\theta}{argmax}{\sum\limits{X_i} log ~p(X_i|\theta) + log~p(\theta)} $$

可见，极大后验估计中相对于最大似然估计，多了log p(θ)，也就是先验的影响。

贝叶斯推断:Bayesian Inference

前面的MAP是一个点估计，只估计似然函数达到最大点的情况下，参数θ的值。而贝叶斯推断通过假设关于参数$\theta$的一个分布而不是直接求单个值来扩展了MAP算法。

Bayesian inference extends the MAP approach by allowing a distribution over the parameter set θ instead of making a direct estimate. Not only encodes this the maximum(a posteriori) value of the data-generated parameters, but it also incorporates expectation as another parameter estimate as well as variance information as a measure of estimation quality or confidence.

具体来说，给定数据X和需要求的参数θ，贝叶斯推断需要求出一个具体的分布: $$ p(\theta|X)=P(X|\theta)P(\theta)/P(X) $$

$X$:观测数据
$\theta$:潜变量

这里和MAP的区别就在于，MAP忽略了P(X)因为它是常量，对于MAP的过程：求导后再求等于0来获得最好的θ，这个常量是没有用的。但是贝叶斯推断要的是整个p(θ|X)的分布，所以P(X)这个normalisation term是需要被求出来的。在获得具体的分布之后，所要求的参数值可以通过估计期望或方差得到。

最近更新于0001-01-01