请注意,文章中部分理解可能已被笔者废弃,本文最后更新于:6 个月前
一元函数下
对于一元函数 f ( x ) f(x) f ( x ) ,其泰勒二阶展开如下为
f ( x k + 1 ) = f ( x k ) + f ′ ( x k ) ( x k + 1 − x k ) + 1 2 f ′ ′ ( x k ) ( x k + 1 − x k ) 2 + O ( x 2 ) . f(x_{k+1})=f(x_k)+f'(x_k)(x_{k+1}-x_k)+\frac{1}{2}f''(x_k)(x_{k+1}-x_k)^2+O(x^2).
f ( x k + 1 ) = f ( x k ) + f ′ ( x k ) ( x k + 1 − x k ) + 2 1 f ′ ′ ( x k ) ( x k + 1 − x k ) 2 + O ( x 2 ) .
记 z = x k + 1 − x k z=x_{k+1}-x_k z = x k + 1 − x k 省略余项可表示为
f ( x k + 1 ) ≈ f ( x k ) + f ′ ( x k ) ⋅ z + 1 2 f ′ ′ ( x k ) ⋅ z 2 = g ( x ) . f(x_{k+1})\approx f(x_k)+f'(x_k)\cdot z+\frac{1}{2}f''(x_k)\cdot z^2=g(x).
f ( x k + 1 ) ≈ f ( x k ) + f ′ ( x k ) ⋅ z + 2 1 f ′ ′ ( x k ) ⋅ z 2 = g ( x ) .
若要求 f ( x ) f(x) f ( x ) 的零点 ,即想要 f ( x k + 1 ) = 0 f(x_{k+1})=0 f ( x k + 1 ) = 0 ,省略二阶展开易得
f ( x k ) + f ′ ( x k ) ⋅ z = 0 , f(x_k)+f'(x_k)\cdot z=0,
f ( x k ) + f ′ ( x k ) ⋅ z = 0 ,
z = − f ( x k ) f ′ ( x k ) , z=-\frac{f(x_k)}{f'(x_k)},
z = − f ′ ( x k ) f ( x k ) ,
x k + 1 = x k − f ( x k ) f ′ ( x k ) . x_{k+1}=x_k-\frac{f(x_k)}{f'(x_k)}.
x k + 1 = x k − f ′ ( x k ) f ( x k ) .
若要求 f ( x ) f(x) f ( x ) 的极值点 ,即想要 f ′ ( x k + 1 ) = 0 f'(x_{k+1})=0 f ′ ( x k + 1 ) = 0 ,也就是 g ′ ( z ) = 0 g'(z)=0 g ′ ( z ) = 0 ,有
g ′ ( z ) = f ′ ( x k ) + f ′ ′ ( x k ) ⋅ z = 0 , g'(z)=f'(x_k)+f''(x_k)\cdot z=0,
g ′ ( z ) = f ′ ( x k ) + f ′ ′ ( x k ) ⋅ z = 0 ,
z = − f ′ ( x k ) f ′ ′ ( x k ) , z=-\frac{f'(x_k)}{f''(x_k)},
z = − f ′ ′ ( x k ) f ′ ( x k ) ,
x k + 1 = x k − f ′ ( x k ) f ′ ′ ( x k ) . x_{k+1}=x_k-\frac{f'(x_k)}{f''(x_k)}.
x k + 1 = x k − f ′ ′ ( x k ) f ′ ( x k ) .
多元函数下
基本过程为:
二阶逼近
逼近式对增量求一阶导为零
求出增量计算格式
代回得到迭代式
二阶逼近为
f ( x k + z ) ≈ f ( x k ) + [ ∇ f ( x k ) ] T z + 1 2 z T [ ∇ 2 f ( x k ) ] z . f\left(\mathbf{x}_{k}+\mathbf{z}\right) \approx f\left(\mathbf{x}_{k}\right)+\left[\nabla f\left(\mathbf{x}_{k}\right)\right]^{T} \mathbf{z}+\frac{1}{2} \mathbf{z}^{T}\left[\nabla^{2} f\left(\mathbf{x}_{k}\right)\right] \mathbf{z}.
f ( x k + z ) ≈ f ( x k ) + [ ∇ f ( x k ) ] T z + 2 1 z T [ ∇ 2 f ( x k ) ] z .
其中 x = ( x 1 , x 2 ⋯ x n ) T \mathbf{x}=\left(x_1,x_2\cdots x_n\right)^T x = ( x 1 , x 2 ⋯ x n ) T ,z = ( z 1 , z 2 ⋯ z n ) T . \mathbf{z}=\left(z_1,z_2\cdots z_n\right)^T. z = ( z 1 , z 2 ⋯ z n ) T . 有
∇ f ( x k ) = ( ∂ f ( x ) ∂ x 1 ∂ f ( x ) ∂ x 2 ⋯ ∂ f ( x ) ∂ x n ) T . \nabla f\left(\mathbf{x}_{k}\right)=\left(\begin{array}{cccc}
\frac{\partial f(\mathbf{x})}{\partial x_{1}} &
\frac{\partial f(\mathbf{x})}{\partial x_{2}} &
\cdots &
\frac{\partial f(\mathbf{x})}{\partial x_{n}}
\end{array}\right)^T. ∇ f ( x k ) = ( ∂ x 1 ∂ f ( x ) ∂ x 2 ∂ f ( x ) ⋯ ∂ x n ∂ f ( x ) ) T .
又有
∇ 2 f = ∇ ⊗ ∇ f = [ ∂ 2 f ∂ 2 x 1 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ∂ x 2 ∂ x 2 ⋯ ∂ 2 f ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x n ∂ x 1 ⋯ … ∂ 2 f ∂ 2 x n ] \nabla^{2} \mathbf{f}=\nabla \otimes \nabla \mathbf{f}=\left[\begin{array}{cccc}
\frac{\partial^{2} f}{\partial^{2} x_{1}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\
\frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \cdots & \ldots & \frac{\partial^{2} f}{\partial^{2} x_{n}}
\end{array}\right] ∇ 2 f = ∇ ⊗ ∇ f = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ ∂ 2 x 1 ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ⋮ ∂ x n ∂ x 1 ∂ 2 f ∂ x 1 ∂ x 2 ∂ 2 f ∂ x 2 ∂ x 2 ∂ 2 f ⋮ ⋯ ⋯ ⋯ ⋱ … ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x n ∂ 2 f ⋮ ∂ 2 x n ∂ 2 f ⎦ ⎥ ⎥ ⎥ ⎥ ⎤
这就是Hessian矩阵,记为 H f ( x ) \mathbf{H}_{f}(\mathbf{x}) H f ( x ) .
令一阶导为 0 0 0 ,得到
∇ f ( x k ) + [ ∇ 2 f ( x k ) ] z = 0 , \nabla f\left(\mathbf{x}_{k}\right)+\left[\nabla^{2} f\left(\mathbf{x}_{k}\right)\right] \mathbf{z}=0,
∇ f ( x k ) + [ ∇ 2 f ( x k ) ] z = 0 ,
z = − H f − 1 ( x k ) ∇ f ( x k ) , \mathbf{z}=-\mathbf{H}_{f}^{-1}\left(\mathbf{x}_{k}\right) \nabla f\left(\mathbf{x}_{k}\right),
z = − H f − 1 ( x k ) ∇ f ( x k ) ,
x k + 1 = x k − H f − 1 ( x k ) ∇ f ( x k ) . \mathbf{x}_{k+1}=\mathbf{x}_{k}-\mathbf{H}_{f}^{-1}\left(\mathbf{x}_{k}\right) \nabla f\left(\mathbf{x}_{k}\right).
x k + 1 = x k − H f − 1 ( x k ) ∇ f ( x k ) .
因为这里Hessian矩阵不一定可逆,最简单直接的思路是在 H f − 1 ( x ) \mathbf{H}^{-1}_{f}(\mathbf{x}) H f − 1 ( x ) 的对角线上加上一个正数:
x k + 1 = x k − ( H f − 1 ( x k ) + μ I ) ∇ f ( x k ) . \mathbf{x}_{k+1}=\mathbf{x}_{k}-\left(\mathbf{H}_{f}^{-1}\left(\mathbf{x}_{k}\right)+\mu \mathbf{I}\right) \nabla f\left(\mathbf{x}_{k}\right).
x k + 1 = x k − ( H f − 1 ( x k ) + μ I ) ∇ f ( x k ) .
求最小二乘问题
记损失函数为
ℓ ( w ) = 1 2 ∥ f ( X , w ) − Y ∥ 2 = 1 2 e 2 . \ell(\mathbf{w})=\frac{1}{2}\|f(\mathbf{X} , \mathbf{w})-\mathbf{Y}\|^{2}=\frac{1}{2} \mathbf{e}^{2}.
ℓ ( w ) = 2 1 ∥ f ( X , w ) − Y ∥ 2 = 2 1 e 2 .
其中 w \mathbf{w} w 是全体参数组成的向量, e \mathbb{e} e 是误差值组成的向量。其二阶展开式如下
ℓ ( w k + z ) ≈ ℓ ( w k ) + ( ∇ ℓ ( w k ) ) T z + 1 2 z T ( ∇ 2 ℓ ( w k ) ) z . \ell\left(\mathbf{w}_{k}+\mathbf{z}\right) \approx \ell\left(\mathbf{w}_{k}\right)+\left(\nabla \ell\left(\mathbf{w}_{k}\right)\right)^{T} \mathbf{z}+\frac{1}{2} \mathbf{z}^{T}\left(\nabla^{2} \ell\left(\mathbf{w}_{k}\right)\right) \mathbf{z}.
ℓ ( w k + z ) ≈ ℓ ( w k ) + ( ∇ ℓ ( w k ) ) T z + 2 1 z T ( ∇ 2 ℓ ( w k ) ) z .
因为最小二乘法的特殊性,可以简化为
∇ ℓ ( w ) = ∂ ( 1 2 e 2 ) ∂ w = ( ∂ e ∂ w ) T e = J e T ( w ) e . \nabla \ell(\mathbf{w})=\frac{\partial\left(\frac{1}{2} \mathbf{e}^{2}\right)}{\partial \mathbf{w}}=\left(\frac{\partial \mathbf{e}}{\partial \mathbf{w}}\right)^{T} \mathbf{e}=\mathbf{J}_{\mathbf{e}}^{T}(\mathbf{w}) \mathbf{e}.
∇ ℓ ( w ) = ∂ w ∂ ( 2 1 e 2 ) = ( ∂ w ∂ e ) T e = J e T ( w ) e .
这就是Jacobian矩阵即为 J e ( w ) \mathbf{J}_{\mathbf{e}}^{}(\mathbf{w}) J e ( w ) .注意始终保证最后的结果是列向量。求二阶导为
∇ 2 ℓ ( w ) = ∇ ( ∇ ℓ ( w ) ) = ∂ ( ( ∂ e ∂ w ) T e ) ∂ w = ( ∂ e ∂ w ) T ( ∂ e ∂ w ) + ∂ 2 e ∂ w 2 e \nabla^{2} \ell(\mathbf{w})=\nabla(\nabla \ell(\mathbf{w}))=\frac{\partial\left(\left(\frac{\partial \mathbf{e}}{\partial \mathbf{w}}\right)^{T} \mathbf{e}\right)}{\partial \mathbf{w}}=\left(\frac{\partial \mathbf{e}}{\partial \mathbf{w}}\right)^{T}\left(\frac{\partial \mathbf{e}}{\partial \mathbf{w}}\right)+\frac{\partial^{2} \mathbf{e}}{\partial \mathbf{w}^{2}} \mathbf{e}
∇ 2 ℓ ( w ) = ∇ ( ∇ ℓ ( w ) ) = ∂ w ∂ ( ( ∂ w ∂ e ) T e ) = ( ∂ w ∂ e ) T ( ∂ w ∂ e ) + ∂ w 2 ∂ 2 e e
= J e T ( w ) J e ( w ) + ∂ 2 e ∂ w 2 e =\mathbf{J}_{\mathbf{e}}^{T}(\mathbf{w}) \mathbf{J}_{\mathbf{e}}(\mathbf{w})+\frac{\partial^{2} \mathbf{e}}{\partial \mathbf{w}^{2}} \mathbf{e}
= J e T ( w ) J e ( w ) + ∂ w 2 ∂ 2 e e
代回得到迭代式
w k + 1 = w k − ( ( J e T ( w ) J e ( w ) + ∂ 2 e ∂ w 2 e ) − 1 J e T ( w ) e ) ∣ w = w k \mathbf{w}_{k+1}=\mathbf{w}_{k}-\left.\left(\left(\mathbf{J}_{\mathbf{e}}^{T}(\mathbf{w}) \mathbf{J}_{\mathbf{e}}(\mathbf{w})+\frac{\partial^{2} \mathbf{e}}{\partial \mathbf{w}^{2}} \mathbf{e}\right)^{-1} \mathbf{J}_{\mathbf{e}}^{T}(\mathbf{w}) \mathbf{e}\right)\right|_{\mathbf{w}=\mathbf{w}_{k}}
w k + 1 = w k − ( ( J e T ( w ) J e ( w ) + ∂ w 2 ∂ 2 e e ) − 1 J e T ( w ) e ) ∣ ∣ ∣ ∣ ∣ w = w k
高斯-牛顿法
在文章 Theoria motus corporum coelestium in sectionibus conicis solem ambientum (1809), pp. 179-180. 中,高斯提到 ∇ 2 ℓ ( w ) \nabla^{2} \ell(\mathbf{w}) ∇ 2 ℓ ( w ) 的二阶项应该很小,所以直接省略,于是再对符号简化一下,这个Hessian近似就可以写成:
∇ 2 ℓ ( w ) = J T J . \nabla^{2} \ell(\mathbf{w})=\mathbf{J}^T\mathbf{J}.
∇ 2 ℓ ( w ) = J T J .
这个式子就是著名的 Gauss-Newton 法的迭代公式。但 J T J \mathbf{J}^T\mathbf{J} J T J 通常是半正定的,但不一定可逆的于是在对角线上加上一个正数,得到 Levenberg-Marquardt 算法。
w k + 1 = w k − ( J T J + μ I ) − 1 J T e \mathbf{w}_{k+1}=\mathbf{w}_{k}-\left(\mathbf{J}^{T} \mathbf{J}+\mu \mathbf{I}\right)^{-1} \mathbf{J}^{T} \mathbf{e}
w k + 1 = w k − ( J T J + μ I ) − 1 J T e
代码
一元函数
求核酸混检问题 中,
E ( x ) = 1 − ( 1 − p ) k − 1 k E(x)=1-(1-p)^k-\frac{1}{k}
E ( x ) = 1 − ( 1 − p ) k − k 1
的最小值点。
clc;clear;
p=0.01 ;
syms k;
E=@(k) 1 -(1 -p).^k+1. /k;
kDif=matlabFunction(diff(E,k,1 ));
kDiif=matlabFunction(diff(E,k,2 ));
x=[0.1 ];
while abs (kDif(x(end )))>1e-10
z=-kDif(x(end ))/kDiif(x(end ));
x(end +1 )=abs (x(end )+z);
end
fprintf('ans=%d\n' ,x(end ));
xx = -10 :0.1 :100 ;
yy=kDif(xx);
plot (xx,yy,'LineWidth' ,3 );
hold on;
for i =1 :size (x,2 )-1 ;
plot ([x(i ),x(i +1 )],[kDif(x(i )),0 ]);
plot ([x(i +1 ),x(i +1 )],[0 ,kDif(x(i +1 ))]);
end
axis([-10 ,100 ,-0.1 ,1 ]);
参考
本文几乎全部复制于知乎回答 ,仅作学习使用。