00 - 数学基础¶
机器学习的数学基石:从基础到前沿,从应用到理论
🎥 视频教程链接¶
中文视频教程¶
B站推荐¶
💡 以下为推荐的UP主和搜索关键词,请在B站直接搜索获取最新内容。
推荐UP主(在B站搜索其名称即可找到): - 3Blue1Brown - 线性代数、微积分、概率论的直觉可视化 - 李沐 - 机器学习数学推导和论文精读 - StatQuest - 机器学习统计概念直观讲解(有中文字幕版)
推荐搜索关键词: - "线性代数 本质"、"矩阵 直觉理解" - "概率论 机器学习"、"最大似然估计 详解" - "梯度下降 优化 直觉"、"反向传播 推导"
国内MOOC平台¶
💡 以下为推荐平台,请在平台内搜索相关课程名称。
英文视频教程¶
YouTube优质频道¶
- 3Blue1Brown - 数学可视化,直观理解
- StatQuest with Josh Starmer - 机器学习数学概念
- Khan Academy - 数学基础课程
- MIT OpenCourseWare - MIT数学课程
- PatrickJMT - 数学问题解答
Coursera课程¶
- Mathematics for Machine Learning Specialization - 机器学习数学专项
- Linear Algebra for Machine Learning - 线性代数
- Multivariable Calculus for Machine Learning - 多元微积分
Udemy课程¶
- Mathematics for Data Science and Machine Learning - 数据科学数学
- Linear Algebra for Machine Learning - 线性代数
- Calculus for Machine Learning - 微积分
edX课程¶
- MIT 18.06SC: Linear Algebra - MIT线性代数
- MIT 18.650: Statistics for Applications - MIT统计学
- Harvard CS109: Data Science - 哈佛数据科学
💻 在线练习平台¶
数学互动学习¶
- Khan Academy - 线性代数、微积分、概率论互动练习
- Brilliant - 数学和科学互动学习,有机器学习数学专题
- Coursera - 搜索"Mathematics for Machine Learning"等课程
- edX - MIT线性代数、概率统计等免费课程
数学计算与可视化工具¶
- Wolfram Alpha - 数学计算、公式验证、符号运算
- Desmos - 函数图像可视化,理解优化曲面
- GeoGebra - 几何与代数可视化,理解线性变换
- SymPy Live - Python符号数学在线环境
目录¶
- 线性代数
- 基础概念
- 矩阵微积分
- 张量基础
- 高级SVD应用
- 谱图理论
- 概率与统计
- 基础概率论
- 测度论基础
- 多元高斯分布
- 指数族分布
- 随机过程
- 统计推断
- 微积分与优化
- 微积分基础
- 凸优化理论
- 高级优化算法
- 约束优化
- 信息论
- 信息熵
- 互信息与KL散度
- 信息瓶颈理论
一、线性代数¶
1.1 基础概念¶
向量与矩阵¶
向量的几何意义: - 向量不仅是数字的排列,更是有向线段,具有大小和方向 - 在机器学习中,向量表示特征空间中的点或方向
矩阵的几何意义: - 矩阵表示线性变换:旋转、缩放、投影、剪切 - 每个矩阵都可以分解为基本的线性变换组合
import numpy as np
import matplotlib.pyplot as plt
# 可视化线性变换
A = np.array([[2, 1], [1, 2]]) # 对称矩阵
eigenvalues, eigenvectors = np.linalg.eig(A) # np.linalg线性代数运算
print(f"矩阵 A:\n{A}")
print(f"\n特征值: {eigenvalues}")
print(f"特征向量:\n{eigenvectors}")
# 几何解释:特征向量是变换中方向不变的向量
# 特征值表示在该方向上的缩放因子
特征值分解的深层理解¶
特征值分解的本质:
其中: - \(Q\):特征向量矩阵,表示新的坐标系 - \(\Lambda\):对角矩阵,表示在各特征方向上的缩放
物理意义: - 特征向量定义了变换的主轴 - 特征值表示沿各主轴的伸缩比例 - 对称矩阵的特征向量构成正交基
应用: - PCA:数据的主成分就是协方差矩阵的特征向量 - PageRank:网页重要性是链接矩阵的主特征向量 - 振动分析:系统的固有频率和模态
# PCA的数学本质:特征值分解
def pca_eig(X, n_components=2):
"""基于特征值分解的PCA"""
# 中心化
X_centered = X - np.mean(X, axis=0)
# 协方差矩阵
cov_matrix = np.cov(X_centered.T)
# 特征值分解(协方差矩阵是对称矩阵,用eigh更稳定且保证实数结果)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# 按特征值排序
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# 投影
components = eigenvectors[:, :n_components]
X_transformed = X_centered @ components
return X_transformed, eigenvalues, eigenvectors
奇异值分解(SVD)的深层理解¶
SVD的本质:
几何解释: 1. \(V^T\):在输入空间旋转到标准正交基 2. \(\Sigma\):在各坐标轴上缩放 3. \(U\):在输出空间旋转
SVD与特征值分解的关系: - \(A^TA = V \Sigma^2 V^T\)(右奇异向量是\(A^TA\)的特征向量) - \(AA^T = U \Sigma^2 U^T\)(左奇异向量是\(AA^T\)的特征向量)
应用: - 低秩近似:保留前k个奇异值 - 伪逆:\(A^+ = V \Sigma^+ U^T\) - 推荐系统:矩阵分解
# SVD低秩近似
def low_rank_approximation(A, k):
"""SVD低秩近似"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# 保留前k个奇异值
U_k = U[:, :k]
s_k = s[:k]
Vt_k = Vt[:k, :]
# 重构
A_approx = U_k @ np.diag(s_k) @ Vt_k
# 计算近似误差
error = np.linalg.norm(A - A_approx, 'fro') / np.linalg.norm(A, 'fro')
return A_approx, error, s
# 示例:图像压缩
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
gray = np.mean(china, axis=2)
A_approx, error, singular_values = low_rank_approximation(gray, k=50)
print(f"压缩率: {50 * (gray.shape[0] + gray.shape[1]) / (gray.shape[0] * gray.shape[1]):.2%}")
print(f"近似误差: {error:.4f}")
1.2 矩阵微积分¶
向量求导基础¶
基本定义: - 标量对向量求导:\(\frac{\partial f}{\partial \mathbf{x}} = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}\right]^T\) - 向量对向量求导(Jacobian矩阵) - 标量对矩阵求导
常用公式:
| 表达式 | 结果 | 条件 |
|---|---|---|
| \(\frac{\partial (\mathbf{a}^T\mathbf{x})}{\partial \mathbf{x}}\) | \(\mathbf{a}\) | |
| \(\frac{\partial (\mathbf{x}^T\mathbf{A}\mathbf{x})}{\partial \mathbf{x}}\) | \((\mathbf{A} + \mathbf{A}^T)\mathbf{x}\) | |
| \(\frac{\partial (\mathbf{A}\mathbf{x})}{\partial \mathbf{x}}\) | \(\mathbf{A}^T\) | |
| \(\frac{\partial (\mathbf{x}^T\mathbf{A})}{\partial \mathbf{x}}\) | \(\mathbf{A}\) | |
| \(\frac{\partial \lVert\mathbf{x}\rVert^2}{\partial \mathbf{x}}\) | \(2\mathbf{x}\) | |
| \(\frac{\partial (\mathbf{x}^T\mathbf{x})}{\partial \mathbf{x}}\) | \(2\mathbf{x}\) |
链式法则的矩阵形式:
若 \(\mathbf{y} = f(\mathbf{g}(\mathbf{x}))\),则:
# 矩阵求导在机器学习中的应用:线性回归的闭式解
def linear_regression_closed_form(X, y):
"""
最小二乘法的闭式解
目标:min ||Xw - y||^2
求导:2X^T(Xw - y) = 0
解得:w = (X^T X)^(-1) X^T y
"""
return np.linalg.inv(X.T @ X) @ X.T @ y
# 梯度推导验证
def verify_gradient():
"""验证梯度计算"""
X = np.random.randn(100, 5)
w_true = np.random.randn(5)
y = X @ w_true + 0.1 * np.random.randn(100)
# 数值梯度
def loss(w):
return np.sum((X @ w - y) ** 2)
def numerical_gradient(w, eps=1e-5):
grad = np.zeros_like(w)
for i in range(len(w)):
w_plus = w.copy()
w_plus[i] += eps
w_minus = w.copy()
w_minus[i] -= eps
grad[i] = (loss(w_plus) - loss(w_minus)) / (2 * eps)
return grad
w = np.random.randn(5)
numerical_grad = numerical_gradient(w)
analytical_grad = 2 * X.T @ (X @ w - y)
print(f"数值梯度: {numerical_grad}")
print(f"解析梯度: {analytical_grad}")
print(f"误差: {np.linalg.norm(numerical_grad - analytical_grad)}")
verify_gradient()
二阶导数与Hessian矩阵¶
Hessian矩阵:
性质: - 对称矩阵(当二阶偏导连续时) - 正定 → 局部最小值 - 负定 → 局部最大值 - 不定 → 鞍点
应用: - 牛顿法:利用Hessian矩阵加速收敛 - 曲率分析:理解损失函数的几何形状
# Hessian矩阵在优化中的应用
def compute_hessian(f, x, eps=1e-5):
"""数值计算Hessian矩阵"""
n = len(x)
hessian = np.zeros((n, n))
for i in range(n):
for j in range(n):
x_pp = x.copy()
x_pp[i] += eps
x_pp[j] += eps
x_pm = x.copy()
x_pm[i] += eps
x_pm[j] -= eps
x_mp = x.copy()
x_mp[i] -= eps
x_mp[j] += eps
x_mm = x.copy()
x_mm[i] -= eps
x_mm[j] -= eps
hessian[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)
return hessian
# 示例:二次函数的Hessian
f = lambda x: x[0]**2 + 2*x[1]**2 + 3*x[0]*x[1]
x = np.array([1.0, 1.0])
hessian = compute_hessian(f, x)
print(f"Hessian矩阵:\n{hessian}")
print(f"特征值: {np.linalg.eigvals(hessian)}")
1.3 张量基础¶
张量的定义与运算¶
张量的阶(Rank): - 0阶:标量 - 1阶:向量 - 2阶:矩阵 - 3阶及以上:高阶张量
张量运算: - 张量积(Outer Product):\(\mathbf{a} \otimes \mathbf{b} \otimes \mathbf{c}\) - 缩并(Contraction):对指定维度求和 - 张量网络:用图表示张量运算
import numpy as np
# 张量基础操作
A = np.random.randn(3, 4, 5) # 3阶张量
B = np.random.randn(4, 5, 6) # 3阶张量
# 张量缩并(在维度1和2上)
C = np.tensordot(A, B, axes=([1, 2], [0, 1]))
print(f"A形状: {A.shape}, B形状: {B.shape}")
print(f"缩并后C形状: {C.shape}")
# Einstein求和约定
def batch_matrix_multiply(A, B):
"""批量矩阵乘法: (batch, m, n) @ (batch, n, p) -> (batch, m, p)"""
return np.einsum('bmn,bnp->bmp', A, B)
A_batch = np.random.randn(10, 3, 4)
B_batch = np.random.randn(10, 4, 5)
C_batch = batch_matrix_multiply(A_batch, B_batch)
print(f"批量矩阵乘法结果形状: {C_batch.shape}")
张量分解¶
CP分解(CANDECOMP/PARAFAC):
Tucker分解:
应用: - 高维数据压缩 - 推荐系统(张量补全) - 神经网络的低秩近似
💡 安装依赖:运行以下代码前需要先安装tensorly库:
from tensorly.decomposition import parafac, tucker
import tensorly as tl
# CP分解
X = np.random.randn(10, 20, 30)
factors = parafac(X, rank=5)
print("CP分解因子形状:")
for i, factor in enumerate(factors.factors): # enumerate同时获取索引和元素
print(f" 因子{i}: {factor.shape}")
# Tucker分解
core, factors = tucker(X, ranks=[5, 10, 15])
print(f"\nTucker分解核心张量形状: {core.shape}")
for i, factor in enumerate(factors):
print(f" 因子矩阵{i}: {factor.shape}")
1.4 高级SVD应用¶
矩阵补全(Matrix Completion)¶
问题设置: - 已知矩阵的部分元素 - 目标是恢复完整矩阵 - 假设矩阵是低秩的
算法: - 奇异值阈值(SVT) - 交替最小二乘(ALS)
def matrix_completion_svt(M, mask, rank=5, max_iter=100, tau=1.0):
"""
奇异值阈值算法进行矩阵补全
M: 观测矩阵(含缺失值设为0)
mask: 观测掩码(1表示观测到,0表示缺失)
"""
X = M.copy()
for iteration in range(max_iter):
# SVD
U, s, Vt = np.linalg.svd(X, full_matrices=False)
# 软阈值(Soft Thresholding)
s_threshold = np.maximum(s - tau, 0)
# 重构
X_new = U @ np.diag(s_threshold) @ Vt
# 保持观测值不变
X = mask * M + (1 - mask) * X_new
# 检查收敛
if iteration % 10 == 0:
error = np.linalg.norm(mask * (X - M), 'fro')
print(f"迭代 {iteration}: 误差 = {error:.6f}")
return X
# 示例
M_true = np.random.randn(50, 40) @ np.random.randn(40, 60) # 秩40矩阵
mask = np.random.rand(50, 60) > 0.7 # 70%缺失
M_observed = M_true * mask
M_completed = matrix_completion_svt(M_observed, mask, rank=10)
error = np.linalg.norm(M_completed - M_true, 'fro') / np.linalg.norm(M_true, 'fro')
print(f"\n补全相对误差: {error:.4f}")
鲁棒PCA¶
问题:将矩阵分解为低秩部分和稀疏部分
其中\(L\)是低秩的,\(S\)是稀疏的(异常值)
def robust_pca(M, lambda_=0.01, max_iter=100):
"""
鲁棒PCA(Principal Component Pursuit)
"""
L = np.zeros_like(M)
S = np.zeros_like(M)
Y = np.zeros_like(M)
mu = 1.25 / np.linalg.norm(M, 2)
for iteration in range(max_iter):
# 更新L(低秩部分)
U, s, Vt = np.linalg.svd(M - S + Y/mu, full_matrices=False)
s_threshold = np.maximum(s - 1/mu, 0)
L = U @ np.diag(s_threshold) @ Vt
# 更新S(稀疏部分)
temp = M - L + Y/mu
S = np.sign(temp) * np.maximum(np.abs(temp) - lambda_/mu, 0)
# 更新对偶变量
Y = Y + mu * (M - L - S)
# 增大mu
mu = min(mu * 1.5, 1e6)
if iteration % 10 == 0:
error = np.linalg.norm(M - L - S, 'fro') / np.linalg.norm(M, 'fro')
print(f"迭代 {iteration}: 相对误差 = {error:.6f}")
return L, S
# 示例:去除图像中的噪声
M_clean = np.random.randn(100, 100) @ np.random.randn(100, 100)
noise = np.random.randn(100, 100) * 0.1
# 添加稀疏异常值
outliers = np.random.rand(100, 100) < 0.05
noise[outliers] = np.random.randn(np.sum(outliers)) * 5
M_noisy = M_clean + noise
L, S = robust_pca(M_noisy)
print(f"\n低秩部分秩: {np.linalg.matrix_rank(L, tol=1e-3)}")
print(f"稀疏部分非零元素比例: {np.mean(S != 0):.2%}")
1.5 谱图理论¶
图拉普拉斯矩阵¶
定义: - 邻接矩阵\(A\):\(A_{ij} = 1\)如果节点\(i\)和\(j\)相连 - 度矩阵\(D\):\(D_{ii} = \sum_j A_{ij}\) - 拉普拉斯矩阵:\(L = D - A\)
归一化拉普拉斯: - 对称归一化:\(L_{sym} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}\) - 随机游走归一化:\(L_{rw} = D^{-1} L = I - D^{-1} A\)
性质: - \(L\)是半正定的 - 最小特征值为0,对应特征向量为全1向量 - 特征值0的重数等于图的连通分量数
import scipy.sparse as sp
from scipy.sparse.linalg import eigsh
def compute_graph_laplacian(adj_matrix, norm_type='sym'):
"""
计算图拉普拉斯矩阵
adj_matrix: 邻接矩阵(可以是稀疏矩阵)
norm_type: 'sym'(对称归一化)或'rw'(随机游走)
"""
# 度矩阵
degrees = np.array(adj_matrix.sum(axis=1)).flatten()
D = sp.diags(degrees)
# 未归一化拉普拉斯
L = D - adj_matrix
if norm_type == 'sym':
# 对称归一化
D_inv_sqrt = sp.diags(1.0 / np.sqrt(degrees + 1e-10))
L_norm = D_inv_sqrt @ L @ D_inv_sqrt
elif norm_type == 'rw':
# 随机游走归一化
D_inv = sp.diags(1.0 / (degrees + 1e-10))
L_norm = D_inv @ L
else:
L_norm = L
return L_norm
# 示例:圆环图的拉普拉斯
n = 20
adj = sp.diags([np.ones(n-1), np.ones(n-1), np.ones(n), np.ones(n)],
offsets=[-1, 1, 1-n, n-1], shape=(n, n))
L = compute_graph_laplacian(adj, norm_type='sym')
# 计算特征值
eigenvalues, eigenvectors = eigsh(L, k=5, which='SM')
print(f"最小的5个特征值: {eigenvalues}")
图傅里叶变换¶
定义: - 经典傅里叶变换:将函数分解为正弦波的叠加 - 图傅里叶变换:将图信号分解为拉普拉斯特征向量的叠加
变换公式: - 正变换:\(\hat{f}(\lambda_l) = \sum_{i=1}^N f(i) u_l(i)\) - 逆变换:\(f(i) = \sum_{l=0}^{N-1} \hat{f}(\lambda_l) u_l(i)\)
应用: - 图卷积网络(GCN)的理论基础 - 图信号处理 - 谱聚类
def graph_fourier_transform(signal, eigenvectors):
"""
图傅里叶变换
signal: 图信号 (N,)
eigenvectors: 拉普拉斯矩阵的特征向量 (N, N)
"""
return eigenvectors.T @ signal
def inverse_graph_fourier_transform(fourier_coeffs, eigenvectors):
"""
逆图傅里叶变换
"""
return eigenvectors @ fourier_coeffs
# 示例:在图上平滑信号
n = 100
# 创建路径图
adj = sp.diags([np.ones(n-1), np.ones(n-1)], offsets=[-1, 1])
L = compute_graph_laplacian(adj)
# 计算特征向量
eigenvalues, eigenvectors = eigsh(L, k=n-1, which='SM')
eigenvectors = np.column_stack([np.ones(n) / np.sqrt(n), eigenvectors])
eigenvalues = np.concatenate([[0], eigenvalues])
# 创建带噪声的信号
signal_clean = np.sin(2 * np.pi * np.arange(n) / n * 3)
signal_noisy = signal_clean + 0.5 * np.random.randn(n)
# 图傅里叶变换
fourier_coeffs = graph_fourier_transform(signal_noisy, eigenvectors)
# 低通滤波:保留低频成分
k = 10 # 保留前k个频率
fourier_filtered = fourier_coeffs.copy()
fourier_filtered[k:] = 0
# 逆变换
signal_filtered = inverse_graph_fourier_transform(fourier_filtered, eigenvectors)
print(f"去噪前MSE: {np.mean((signal_noisy - signal_clean)**2):.4f}")
print(f"去噪后MSE: {np.mean((signal_filtered - signal_clean)**2):.4f}")
二、概率与统计¶
2.1 基础概率论¶
概率空间与随机变量¶
概率空间的三要素:\((\Omega, \mathcal{F}, P)\) - \(\Omega\):样本空间(所有可能结果的集合) - \(\mathcal{F}\):事件空间(\(\Omega\)的子集的集合) - \(P\):概率测度
随机变量的类型: - 离散型:概率质量函数(PMF)\(P(X = x)\) - 连续型:概率密度函数(PDF)\(f(x)\),\(P(a < X < b) = \int_a^b f(x) dx\)
期望与方差: - 期望:\(E[X] = \int x f(x) dx\)(连续)或 \(\sum x P(X=x)\)(离散) - 方差:\(Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2\)
from scipy import stats
import numpy as np
# 常见分布
# 1. 正态分布
normal_dist = stats.norm(loc=0, scale=1)
print(f"正态分布期望: {normal_dist.mean()}")
print(f"正态分布方差: {normal_dist.var()}")
# 2. 二项分布
binomial_dist = stats.binom(n=10, p=0.3)
print(f"\n二项分布期望: {binomial_dist.mean()}")
print(f"二项分布方差: {binomial_dist.var()}")
# 3. 泊松分布
poisson_dist = stats.poisson(mu=3)
print(f"\n泊松分布期望: {poisson_dist.mean()}")
print(f"泊松分布方差: {poisson_dist.var()}")
# 大数定律验证
sample_sizes = [10, 100, 1000, 10000, 100000]
for n in sample_sizes:
samples = np.random.randn(n)
sample_mean = np.mean(samples)
print(f"样本量 {n:6d}: 样本均值 = {sample_mean:.6f}")
条件概率与贝叶斯定理¶
条件概率:\(P(A|B) = \frac{P(A \cap B)}{P(B)}\),其中\(P(B) > 0\)
贝叶斯定理:
其中: - \(P(H)\):先验概率 - \(P(E|H)\):似然 - \(P(H|E)\):后验概率 - \(P(E)\):证据(边际似然)
应用:朴素贝叶斯分类器
# 贝叶斯定理应用:垃圾邮件分类
class NaiveBayesClassifier:
"""
多项式朴素贝叶斯(用于文本分类)
"""
def __init__(self, alpha=1.0): # __init__构造方法,创建对象时自动调用
self.alpha = alpha # 拉普拉斯平滑参数
self.class_priors = {}
self.word_probs = {}
self.vocab = set()
def fit(self, X, y):
"""
X: 文档列表,每个文档是词列表
y: 类别标签
"""
n_samples = len(X)
classes = np.unique(y)
# 构建词汇表
for doc in X:
self.vocab.update(doc)
self.vocab = list(self.vocab)
word_to_idx = {word: i for i, word in enumerate(self.vocab)}
for c in classes:
# 计算先验概率 P(y=c)
class_docs = [X[i] for i in range(n_samples) if y[i] == c]
self.class_priors[c] = len(class_docs) / n_samples
# 计算词频
word_counts = np.zeros(len(self.vocab))
for doc in class_docs:
for word in doc:
word_counts[word_to_idx[word]] += 1
# 计算条件概率 P(word|y=c)(带平滑)
total_words = word_counts.sum()
self.word_probs[c] = (word_counts + self.alpha) / (total_words + self.alpha * len(self.vocab))
self.word_to_idx = word_to_idx
return self
def predict(self, X):
predictions = []
for doc in X:
scores = {}
for c in self.class_priors:
# log P(y=c) + sum(log P(word|y=c))
score = np.log(self.class_priors[c])
for word in doc:
if word in self.word_to_idx:
score += np.log(self.word_probs[c][self.word_to_idx[word]])
scores[c] = score
predictions.append(max(scores, key=scores.get))
return predictions
# 示例
docs = [
['free', 'win', 'money', 'now'],
['meeting', 'schedule', 'tomorrow'],
['win', 'prize', 'free'],
['project', 'deadline', 'meeting']
]
labels = ['spam', 'ham', 'spam', 'ham']
clf = NaiveBayesClassifier()
clf.fit(docs, labels)
test_doc = [['free', 'win']]
prediction = clf.predict(test_doc)
print(f"测试文档分类结果: {prediction[0]}")
2.2 测度论基础(概念性介绍)¶
为什么要学测度论?¶
测度论解决的问题: 1. 不可数集上的概率:连续随机变量的严格定义 2. 复杂事件的概率:极限、积分、期望的严格定义 3. 大数定律和中心极限定理的严格证明
核心概念: - σ-代数(σ-algebra):可测事件的集合 - 测度(Measure):集合的"大小"(长度、面积、体积的推广) - 可测函数:可以积分的函数 - 勒贝格积分:比黎曼积分更一般的积分
与机器学习的联系: - 泛化误差的严格定义 - 经验风险最小化的理论基础 - 随机过程的严格处理
2.3 多元高斯分布¶
定义与性质¶
概率密度函数:
几何解释: - 均值\(\boldsymbol{\mu}\):分布的中心 - 协方差\(\Sigma\):决定分布的形状和方向 - 等高线:椭圆(马氏距离为常数的点)
重要性质: 1. 边缘分布仍是高斯分布 2. 条件分布仍是高斯分布 3. 线性变换后仍是高斯分布 4. 独立 ↔ 不相关(高斯分布特有的性质)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
# 多元高斯分布可视化
def plot_gaussian_2d(mean, cov, title):
x, y = np.mgrid[-3:3:.01, -3:3:.01]
pos = np.dstack((x, y))
rv = multivariate_normal(mean, cov)
plt.figure(figsize=(8, 6))
plt.contourf(x, y, rv.pdf(pos), levels=20, cmap='viridis')
plt.colorbar(label='Probability Density')
plt.title(title)
plt.xlabel('x1')
plt.ylabel('x2')
# 绘制特征向量
eigenvalues, eigenvectors = np.linalg.eig(cov)
for i in range(2):
vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i])
plt.arrow(mean[0], mean[1], vec[0], vec[1],
head_width=0.1, head_length=0.1, fc='red', ec='red')
plt.show()
# 不同协方差矩阵的形状
plot_gaussian_2d([0, 0], [[1, 0], [0, 1]], '各向同性(球形)')
plot_gaussian_2d([0, 0], [[2, 0], [0, 0.5]], '轴对齐椭圆')
plot_gaussian_2d([0, 0], [[1, 0.8], [0.8, 1]], '旋转椭圆(正相关)')
plot_gaussian_2d([0, 0], [[1, -0.8], [-0.8, 1]], '旋转椭圆(负相关)')
条件分布与边缘分布¶
条件分布:
若 \(\mathbf{x} = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{bmatrix}\),则:
其中: - \(\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12} \Sigma_{22}^{-1} (\mathbf{a} - \boldsymbol{\mu}_2)\) - \(\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}\)
应用:高斯过程回归、卡尔曼滤波
def conditional_gaussian(mu, Sigma, x2_val, idx1, idx2):
"""
计算多元高斯的条件分布
mu: 均值向量
Sigma: 协方差矩阵
x2_val: 观测到的x2值
idx1: 需要推断的变量索引
idx2: 观测变量的索引
"""
mu1 = mu[idx1]
mu2 = mu[idx2]
Sigma11 = Sigma[np.ix_(idx1, idx1)]
Sigma12 = Sigma[np.ix_(idx1, idx2)]
Sigma21 = Sigma[np.ix_(idx2, idx1)]
Sigma22 = Sigma[np.ix_(idx2, idx2)]
# 条件均值
mu_cond = mu1 + Sigma12 @ np.linalg.inv(Sigma22) @ (x2_val - mu2)
# 条件协方差
Sigma_cond = Sigma11 - Sigma12 @ np.linalg.inv(Sigma22) @ Sigma21
return mu_cond, Sigma_cond
# 示例
mu = np.array([0, 0])
Sigma = np.array([[1, 0.8], [0.8, 1]])
# 已知 x2 = 1.5,求 x1 的条件分布
x2_val = np.array([1.5])
mu_cond, Sigma_cond = conditional_gaussian(mu, Sigma, x2_val, [0], [1])
print(f"条件分布: x1 | x2=1.5 ~ N({mu_cond[0]:.3f}, {Sigma_cond[0,0]:.3f})")
print(f"先验分布: x1 ~ N({mu[0]:.3f}, {Sigma[0,0]:.3f})")
print(f"观测x2后,x1的不确定性从{Sigma[0,0]:.3f}降低到{Sigma_cond[0,0]:.3f}")
2.4 指数族分布¶
统一理论框架¶
标准形式:
其中: - \(\boldsymbol{\eta}(\boldsymbol{\theta})\):自然参数 - \(\mathbf{T}(x)\):充分统计量 - \(A(\boldsymbol{\theta})\):对数配分函数(保证归一化) - \(h(x)\):基础测度
常见分布的指数族形式:
| 分布 | 自然参数 | 充分统计量 | 对数配分函数 |
|---|---|---|---|
| 正态分布 | \([\mu/\sigma^2, -1/(2\sigma^2)]\) | \([x, x^2]\) | \(\mu^2/(2\sigma^2) + \log\sigma\) |
| 伯努利分布 | \(\log(p/(1-p))\) | \(x\) | \(\log(1+e^\eta)\) |
| 泊松分布 | \(\log\lambda\) | \(x\) | \(e^\eta\) |
| 指数分布 | \(-\lambda\) | \(x\) | \(-\log(-\eta)\) |
重要性质: 1. \(E[T(X)] = \nabla_\eta A(\eta)\) 2. \(Var(T(X)) = \nabla^2_\eta A(\eta)\)
应用: - 广义线性模型(GLM) - 变分推断 - 自然梯度下降
# 指数族分布的统一视角
import numpy as np
from scipy.special import logsumexp
class ExponentialFamily:
"""
指数族分布的基类
"""
def __init__(self, natural_params):
self.eta = natural_params
def log_partition(self):
"""对数配分函数 A(eta)"""
raise NotImplementedError
def sufficient_stats(self, x):
"""充分统计量 T(x)"""
raise NotImplementedError
def log_pdf(self, x):
"""对数概率密度"""
return np.dot(self.eta, self.sufficient_stats(x)) - self.log_partition() # np.dot计算点积/矩阵乘法
def mean(self):
"""E[T(X)] = dA/deta"""
raise NotImplementedError
def variance(self):
"""Var(T(X)) = d^2A/deta^2"""
raise NotImplementedError
class GaussianEF(ExponentialFamily):
"""正态分布的指数族形式"""
def __init__(self, mu, sigma2):
# 自然参数: [mu/sigma^2, -1/(2*sigma^2)]
eta = np.array([mu / sigma2, -0.5 / sigma2])
super().__init__(eta) # super()调用父类方法
self.mu = mu
self.sigma2 = sigma2
def log_partition(self):
eta1, eta2 = self.eta
return -eta1**2 / (4*eta2) - 0.5 * np.log(-2*eta2)
def sufficient_stats(self, x):
return np.array([x, x**2])
def mean(self):
return np.array([self.mu, self.sigma2 + self.mu**2])
def variance(self):
return np.array([[self.sigma2, 2*self.mu*self.sigma2],
[2*self.mu*self.sigma2, 2*self.sigma2**2 + 4*self.mu**2*self.sigma2]])
# 验证
gaussian = GaussianEF(mu=2, sigma2=4)
print(f"自然参数: {gaussian.eta}")
print(f"对数配分函数: {gaussian.log_partition():.4f}")
print(f"均值: {gaussian.mean()}")
2.5 随机过程¶
马尔可夫链¶
定义: - 状态序列 \(X_1, X_2, ..., X_n\) - 马尔可夫性质:\(P(X_{t+1} | X_t, X_{t-1}, ..., X_1) = P(X_{t+1} | X_t)\)
转移矩阵:\(P_{ij} = P(X_{t+1} = j | X_t = i)\)
平稳分布:\(\pi = \pi P\),满足细致平衡条件:\(\pi_i P_{ij} = \pi_j P_{ji}\)
应用: - PageRank算法 - MCMC采样 - 强化学习中的MDP
import numpy as np
class MarkovChain:
def __init__(self, transition_matrix):
self.P = transition_matrix
self.n_states = transition_matrix.shape[0]
def simulate(self, initial_state, n_steps):
"""模拟马尔可夫链"""
states = [initial_state]
current = initial_state
for _ in range(n_steps - 1):
current = np.random.choice(self.n_states, p=self.P[current])
states.append(current)
return states
def stationary_distribution(self):
"""计算平稳分布"""
# 解方程: pi P = pi, sum(pi) = 1
A = np.vstack([self.P.T - np.eye(self.n_states), np.ones(self.n_states)])
b = np.zeros(self.n_states + 1)
b[-1] = 1 # [-1]负索引取最后一个元素
pi = np.linalg.lstsq(A, b, rcond=None)[0]
return pi
def n_step_transition(self, n):
"""n步转移矩阵"""
return np.linalg.matrix_power(self.P, n)
# 示例:天气模型
# 状态: 0=晴天, 1=多云, 2=雨天
P = np.array([
[0.7, 0.2, 0.1], # 晴天 -> 晴天/多云/雨天
[0.3, 0.4, 0.3], # 多云 -> 晴天/多云/雨天
[0.2, 0.3, 0.5] # 雨天 -> 晴天/多云/雨天
])
mc = MarkovChain(P)
pi = mc.stationary_distribution()
print(f"平稳分布: 晴天={pi[0]:.3f}, 多云={pi[1]:.3f}, 雨天={pi[2]:.3f}")
# 长期预测
P_30 = mc.n_step_transition(30)
print(f"\n30天后(无论从什么天气开始):")
print(f"晴天概率: {P_30[0, 0]:.3f}")
print(f"多云概率: {P_30[0, 1]:.3f}")
print(f"雨天概率: {P_30[0, 2]:.3f}")
高斯过程¶
定义: - 随机变量的集合,任意有限个变量的联合分布都是多元高斯分布 - 完全由均值函数 \(m(x)\) 和协方差函数 \(k(x, x')\) 决定
协方差函数(核函数): - RBF核:\(k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2l^2}\right)\) - 线性核:\(k(x, x') = x^T x'\) - 周期核:\(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{l^2}\right)\)
应用: - 贝叶斯优化 - 回归和分类 - 时间序列预测
import numpy as np
from scipy.spatial.distance import cdist
class GaussianProcess:
"""
高斯过程回归
"""
def __init__(self, kernel='rbf', noise=1e-5, **kernel_params):
self.kernel_name = kernel
self.noise = noise
self.kernel_params = kernel_params
self.X_train = None
self.y_train = None
self.K = None
def kernel(self, X1, X2):
"""计算核矩阵"""
if self.kernel_name == 'rbf':
length_scale = self.kernel_params.get('length_scale', 1.0)
sigma_f = self.kernel_params.get('sigma_f', 1.0)
dist = cdist(X1, X2, 'sqeuclidean')
return sigma_f**2 * np.exp(-0.5 * dist / length_scale**2)
elif self.kernel_name == 'linear':
return X1 @ X2.T
else:
raise ValueError(f"Unknown kernel: {self.kernel_name}")
def fit(self, X, y):
"""训练高斯过程"""
self.X_train = X
self.y_train = y
# 计算核矩阵
K = self.kernel(X, X)
self.K = K + self.noise * np.eye(len(X))
self.L = np.linalg.cholesky(self.K)
self.alpha = np.linalg.solve(self.L.T, np.linalg.solve(self.L, y))
return self
def predict(self, X_test, return_std=True):
"""预测"""
# 计算测试点与训练点的核
K_s = self.kernel(self.X_train, X_test)
# 均值
mu = K_s.T @ self.alpha
if return_std:
# 方差
K_ss = self.kernel(X_test, X_test)
v = np.linalg.solve(self.L, K_s)
var = np.diag(K_ss) - np.sum(v**2, axis=0)
std = np.sqrt(np.maximum(var, 0))
return mu, std
return mu
# 示例:用GP拟合函数
np.random.seed(42)
# 真实函数
def f(x):
return np.sin(x) + 0.1 * x**2
# 训练数据
X_train = np.array([-3, -2, -1, 0, 1, 2, 3]).reshape(-1, 1)
y_train = f(X_train).ravel() + 0.1 * np.random.randn(len(X_train))
# 测试点
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
# 拟合GP
gp = GaussianProcess(kernel='rbf', length_scale=1.0, sigma_f=1.0)
gp.fit(X_train, y_train)
mu, std = gp.predict(X_test)
# 可视化
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(X_test, f(X_test), 'r--', label='True function')
plt.plot(X_test, mu, 'b-', label='GP mean')
plt.fill_between(X_test.ravel(), mu - 2*std, mu + 2*std, alpha=0.2, label='95% confidence')
plt.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Training data')
plt.legend()
plt.title('Gaussian Process Regression')
plt.show()
2.6 统计推断¶
最大似然估计(MLE)¶
原理:选择使观测数据出现概率最大的参数
性质: - 一致性:\(\hat{\theta}_{MLE} \xrightarrow{p} \theta_0\)(当\(n \to \infty\)) - 渐近正态性:\(\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta_0))\) - 渐近有效性:达到Cramér-Rao下界
from scipy.optimize import minimize
import numpy as np
from scipy.stats import norm
def maximum_likelihood_estimation(data, distribution='normal'):
"""
最大似然估计
data: 观测数据
distribution: 分布类型
"""
if distribution == 'normal':
# 正态分布的MLE有解析解
mu_mle = np.mean(data)
sigma2_mle = np.var(data, ddof=0) # 注意:MLE用n而不是n-1
return {'mu': mu_mle, 'sigma2': sigma2_mle}
elif distribution == 'exponential':
# 指数分布: lambda_MLE = 1/mean
lambda_mle = 1 / np.mean(data)
return {'lambda': lambda_mle}
else:
raise ValueError(f"Unknown distribution: {distribution}")
# 示例:估计正态分布参数
true_mu, true_sigma = 5, 2
data = np.random.normal(true_mu, true_sigma, size=1000)
params = maximum_likelihood_estimation(data, 'normal')
print(f"真实参数: mu={true_mu}, sigma^2={true_sigma**2}")
print(f"MLE估计: mu={params['mu']:.4f}, sigma^2={params['sigma2']:.4f}")
# 可视化似然函数
mu_range = np.linspace(3, 7, 100)
sigma_range = np.linspace(1, 4, 100)
Mu, Sigma = np.meshgrid(mu_range, sigma_range)
log_likelihood = np.zeros_like(Mu)
for i in range(len(sigma_range)):
for j in range(len(mu_range)):
log_likelihood[i, j] = np.sum(norm.logpdf(data, Mu[i, j], Sigma[i, j]))
plt.figure(figsize=(10, 6))
contour = plt.contour(Mu, Sigma, log_likelihood, levels=20)
plt.clabel(contour, inline=True, fontsize=8)
plt.plot(params['mu'], np.sqrt(params['sigma2']), 'r*', markersize=15, label='MLE')
plt.xlabel('mu')
plt.ylabel('sigma')
plt.title('Log-Likelihood Contours')
plt.legend()
plt.show()
Fisher信息矩阵与Cramér-Rao下界¶
Fisher信息矩阵:
Cramér-Rao下界:
任何无偏估计量的方差满足:
意义:MLE在渐近意义下达到最优
def compute_fisher_information_normal(mu, sigma2, n):
"""
计算正态分布的Fisher信息矩阵
对于正态分布 N(mu, sigma^2):
I(mu) = n / sigma^2
I(sigma^2) = n / (2 * sigma^4)
"""
I_mu = n / sigma2
I_sigma2 = n / (2 * sigma2**2)
return I_mu, I_sigma2
# 验证Cramér-Rao下界
n = 100
mu_true, sigma2_true = 5, 4
# 多次实验,计算样本均值的方差
n_experiments = 1000
mu_estimates = []
for _ in range(n_experiments):
data = np.random.normal(mu_true, np.sqrt(sigma2_true), n)
mu_estimates.append(np.mean(data))
empirical_variance = np.var(mu_estimates, ddof=1)
# 理论下界
I_mu, _ = compute_fisher_information_normal(mu_true, sigma2_true, n)
cramer_rao_bound = 1 / I_mu
print(f"样本均值的实证方差: {empirical_variance:.6f}")
print(f"Cramér-Rao下界: {cramer_rao_bound:.6f}")
print(f"MLE达到了下界: {np.isclose(empirical_variance, cramer_rao_bound, rtol=0.1)}")
最大后验估计(MAP)¶
原理:结合先验知识和观测数据
与MLE的关系: - 当先验是均匀分布时,MAP = MLE - MAP可以看作正则化的MLE
def map_estimate_normal(data, prior_mu=0, prior_sigma2=1, prior_weight=1):
"""
正态分布均值的MAP估计
先验: mu ~ N(prior_mu, prior_sigma2)
似然: data ~ N(mu, sigma^2)(假设sigma^2已知)
"""
n = len(data)
sample_mean = np.mean(data)
# 假设sigma^2 = 1(简化)
sigma2 = 1
# MAP估计是高斯先验和高斯似然的加权平均
# 后验: mu|data ~ N(mu_n, sigma_n^2)
# 其中:
# 1/sigma_n^2 = n/sigma^2 + 1/prior_sigma2
# mu_n = sigma_n^2 * (n*sample_mean/sigma^2 + prior_mu/prior_sigma2)
posterior_precision = n / sigma2 + 1 / prior_sigma2
posterior_variance = 1 / posterior_precision
posterior_mean = posterior_variance * (n * sample_mean / sigma2 + prior_mu / prior_sigma2)
return posterior_mean, posterior_variance
# 示例:小样本情况下的MAP优势
np.random.seed(42)
true_mu = 5
n_samples = 5 # 小样本
data = np.random.normal(true_mu, 1, n_samples)
# MLE
mle = np.mean(data)
# MAP(强先验:认为mu在0附近)
map_est, map_var = map_estimate_normal(data, prior_mu=0, prior_sigma2=1)
# MAP(弱先验)
map_est_weak, _ = map_estimate_normal(data, prior_mu=0, prior_sigma2=10)
print(f"真实值: {true_mu}")
print(f"MLE: {mle:.4f}")
print(f"MAP(强先验): {map_est:.4f}")
print(f"MAP(弱先验): {map_est_weak:.4f}")
print(f"\n强先验将估计向0拉近,起到正则化作用")
三、微积分与优化¶
3.1 微积分基础¶
梯度、Hessian与泰勒展开¶
一阶泰勒展开:
二阶泰勒展开:
几何意义: - 梯度:函数增长最快的方向 - Hessian:函数的局部曲率
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# 可视化梯度
def f(x, y):
return x**2 + 2*y**2
def grad_f(x, y):
return np.array([2*x, 4*y])
# 绘制函数和梯度场
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
U, V = grad_f(X, Y)
plt.figure(figsize=(10, 8))
contour = plt.contour(X, Y, Z, levels=15)
plt.clabel(contour, inline=True, fontsize=8)
plt.quiver(X, Y, U, V, alpha=0.6)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Function Contours and Gradient Field')
plt.colorbar(contour)
plt.show()
# 梯度指向函数增长最快的方向,垂直于等高线
链式法则与反向传播¶
多变量链式法则:
若 \(z = f(y_1, y_2, ..., y_m)\),\(y_i = g_i(x_1, x_2, ..., x_n)\),则:
反向传播的本质: - 动态规划高效计算梯度 - 从输出层向输入层传播误差
import numpy as np
class ComputationalGraph:
"""
计算图和反向传播的简单实现
"""
def __init__(self):
self.gradients = {}
def forward(self, x, y, z):
"""
计算: f = (x + y) * z
"""
self.x = x
self.y = y
self.z = z
self.q = x + y # 中间变量
self.f = self.q * z
return self.f
def backward(self, df=1):
"""
反向传播计算梯度
f = q * z
q = x + y
df/df = 1
df/dq = z
df/dz = q
df/dx = df/dq * dq/dx = z * 1 = z
df/dy = df/dq * dq/dy = z * 1 = z
"""
# 从输出开始
df_dq = self.z
df_dz = self.q
# 传播到输入
df_dx = df_dq * 1 # dq/dx = 1
df_dy = df_dq * 1 # dq/dy = 1
self.gradients = {
'x': df_dx,
'y': df_dy,
'z': df_dz
}
return self.gradients
# 示例
graph = ComputationalGraph()
f = graph.forward(x=2, y=3, z=4)
grads = graph.backward() # 反向传播计算梯度
print(f"f = (2 + 3) * 4 = {f}")
print(f"df/dx = {grads['x']}")
print(f"df/dy = {grads['y']}")
print(f"df/dz = {grads['z']}")
# 数值验证
eps = 1e-5
f_x_plus = graph.forward(x=2+eps, y=3, z=4)
grad_x_numerical = (f_x_plus - f) / eps
print(f"\n数值验证 df/dx: {grad_x_numerical}")
3.2 凸优化理论¶
凸集与凸函数¶
凸集定义:集合\(C\)是凸的,如果对于任意\(\mathbf{x}, \mathbf{y} \in C\)和\(\theta \in [0, 1]\):
凸函数定义:函数\(f\)是凸的,如果:
等价条件(可微函数): - 一阶条件:\(f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x})\) - 二阶条件:\(H(\mathbf{x}) \succeq 0\)(Hessian半正定)
为什么凸优化重要? - 局部最优 = 全局最优 - 有高效的算法 - 很多ML问题是凸的(或可以近似为凸的)
import numpy as np
import matplotlib.pyplot as plt
# 可视化凸函数和非凸函数
x = np.linspace(-3, 3, 100)
# 凸函数
f_convex = x**2
# 非凸函数
f_nonconvex = x**4 - 2*x**2 + 0.5*x
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, f_convex, 'b-', linewidth=2)
plt.title('Convex Function: $f(x) = x^2$')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)
# 画割线说明凸性
x1, x2 = -1.5, 1.5
plt.plot([x1, x2], [x1**2, x2**2], 'r--', label='Chord')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, f_nonconvex, 'b-', linewidth=2)
plt.title('Non-Convex Function')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)
plt.tight_layout()
plt.show()
# 验证凸性:检查Hessian是否半正定
def check_convexity_1d(f_func, x_range):
"""检查一元函数的凸性(通过二阶导数)"""
from scipy.optimize import approx_fprime
# 使用中心差分近似计算二阶导数
def second_derivative(f, x, eps=1e-6):
"""计算二阶导数的中心差分近似"""
return (f(x + eps) - 2 * f(x) + f(x - eps)) / (eps ** 2)
second_derivatives = [second_derivative(f_func, x) for x in x_range]
if all(d >= 0 for d in second_derivatives): # all()所有元素为True才返回True
return True, min(second_derivatives)
else:
return False, min(second_derivatives)
is_convex, min_second_deriv = check_convexity_1d(lambda x: x**2, np.linspace(-2, 2, 10))
print(f"f(x)=x^2 是凸函数: {is_convex}, 最小二阶导数: {min_second_deriv:.4f}")
对偶理论¶
拉格朗日函数:
对于优化问题: $\(\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, \quad h_j(\mathbf{x}) = 0\)$
拉格朗日函数: $\(L(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})\)$
对偶函数: $\(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} L(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\)$
对偶问题: $\(\max_{\boldsymbol{\lambda} \geq 0, \boldsymbol{\nu}} g(\boldsymbol{\lambda}, \boldsymbol{\nu})\)$
弱对偶性:\(d^* \leq p^*\)(对偶最优值 ≤ 原始最优值) 强对偶性:\(d^* = p^*\)(在凸优化且满足约束规格时成立)
from scipy.optimize import minimize
import numpy as np
def solve_primal_dual():
"""
求解原始问题和对偶问题
原始问题:
min x^2 + y^2
s.t. x + y >= 1
"""
# 原始问题
def objective(x):
return x[0]**2 + x[1]**2
def constraint(x):
return x[0] + x[1] - 1 # >= 0
# 求解原始问题
result_primal = minimize(
objective,
x0=[0, 0],
method='SLSQP',
constraints={'type': 'ineq', 'fun': constraint}
)
p_star = result_primal.fun
x_star = result_primal.x
# 对偶问题: max_lambda min_{x,y} x^2 + y^2 + lambda*(1 - x - y)
# 对于固定的lambda,最优x = y = lambda/2
# 代入得: g(lambda) = lambda - lambda^2/2
# 对偶问题: max_lambda lambda - lambda^2/2, s.t. lambda >= 0
def dual_objective(lambda_):
return -(lambda_ - lambda_**2 / 2) # 取负因为minimize做最大化
result_dual = minimize(
dual_objective,
x0=[0.5],
method='SLSQP',
bounds=[(0, None)]
)
d_star = -result_dual.fun
lambda_star = result_dual.x[0]
print("原始问题:")
print(f" 最优值 p* = {p_star:.6f}")
print(f" 最优解 x* = [{x_star[0]:.6f}, {x_star[1]:.6f}]")
print("\n对偶问题:")
print(f" 最优值 d* = {d_star:.6f}")
print(f" 最优解 lambda* = {lambda_star:.6f}")
print(f"\n对偶间隙: p* - d* = {p_star - d_star:.10f}")
print(f"强对偶性成立: {np.isclose(p_star, d_star)}")
solve_primal_dual()
KKT条件¶
KKT条件(Karush-Kuhn-Tucker):
对于凸优化问题,如果强对偶性成立,则\(\mathbf{x}^*\)是最优解当且仅当存在\(\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*\)满足:
- 稳定性:\(\nabla f(\mathbf{x}^*) + \sum_i \lambda_i^* \nabla g_i(\mathbf{x}^*) + \sum_j \nu_j^* \nabla h_j(\mathbf{x}^*) = 0\)
- 原始可行性:\(g_i(\mathbf{x}^*) \leq 0\),\(h_j(\mathbf{x}^*) = 0\)
- 对偶可行性:\(\lambda_i^* \geq 0\)
- 互补松弛性:\(\lambda_i^* g_i(\mathbf{x}^*) = 0\)
意义:KKT条件将约束优化问题转化为方程组求解
import numpy as np
from scipy.optimize import minimize
def verify_kkt():
"""
验证KKT条件
问题: min (x-2)^2 + (y-1)^2
s.t. x + y <= 2
x >= 0, y >= 0
"""
# 求解
def objective(x):
return (x[0] - 2)**2 + (x[1] - 1)**2
def grad_objective(x):
return np.array([2*(x[0]-2), 2*(x[1]-1)])
constraints = [
{'type': 'ineq', 'fun': lambda x: 2 - x[0] - x[1]}, # x + y <= 2
{'type': 'ineq', 'fun': lambda x: x[0]}, # x >= 0
{'type': 'ineq', 'fun': lambda x: x[1]} # y >= 0
]
result = minimize(objective, x0=[0, 0], method='SLSQP', constraints=constraints)
x_star = result.x
print(f"最优解: x* = [{x_star[0]:.6f}, {x_star[1]:.6f}]")
print(f"约束 x + y <= 2: {x_star[0] + x_star[1]:.6f}")
# 在解处,第一个约束是紧的(active)
# 根据KKT条件,存在 lambda >= 0 使得:
# grad_f(x*) + lambda * grad_g(x*) = 0
# 其中 g(x) = x + y - 2
# grad_f = [2(x-2), 2(y-1)]
# grad_g = [1, 1]
# 2(x-2) + lambda = 0 => lambda = 2(2-x)
# 2(y-1) + lambda = 0 => lambda = 2(1-y)
lambda_kkt = 2 * (2 - x_star[0])
print(f"\nKKT乘子: lambda = {lambda_kkt:.6f}")
# 验证稳定性条件
grad_f = grad_objective(x_star)
grad_g = np.array([1, 1])
stability = grad_f + lambda_kkt * grad_g
print(f"\nKKT稳定性条件:")
print(f" grad_f(x*) = [{grad_f[0]:.6f}, {grad_f[1]:.6f}]")
print(f" grad_f(x*) + lambda * grad_g(x*) = [{stability[0]:.10f}, {stability[1]:.10f}]")
print(f" 条件满足: {np.allclose(stability, 0)}")
# 验证互补松弛性
complementarity = lambda_kkt * (x_star[0] + x_star[1] - 2)
print(f"\n互补松弛性: lambda * g(x*) = {complementarity:.10f}")
verify_kkt()
3.3 高级优化算法¶
牛顿法与拟牛顿法¶
牛顿法:
利用二阶信息(Hessian)加速收敛:
优点: - 二次收敛(在解附近) - 收敛速度快
缺点: - 需要计算和存储Hessian矩阵 - 需要求解线性方程组
拟牛顿法(BFGS):
用近似矩阵\(B_t\)代替Hessian,满足割线条件:
其中\(\mathbf{s}_t = \mathbf{x}_{t+1} - \mathbf{x}_t\),\(\mathbf{y}_t = \nabla f_{t+1} - \nabla f_t\)
import numpy as np
from scipy.optimize import minimize
def rosenbrock(x):
"""Rosenbrock函数(优化测试函数)"""
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0) # 切片操作取子序列
def rosenbrock_grad(x):
"""Rosenbrock函数的梯度"""
xm = x[1:-1]
xm_m1 = x[:-2]
xm_p1 = x[2:]
der = np.zeros_like(x)
der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
der[-1] = 200*(x[-1]-x[-2]**2)
return der
# 比较不同优化算法
x0 = np.array([-1.0, 1.0])
methods = ['Nelder-Mead', 'BFGS', 'Newton-CG']
results = {}
for method in methods:
if method == 'Newton-CG':
# 牛顿法需要Hessian或Hessian向量积
result = minimize(rosenbrock, x0, method=method,
jac=rosenbrock_grad, # Newton-CG会自动使用有限差分近似Hessian
options={'maxiter': 100})
else:
result = minimize(rosenbrock, x0, method=method,
jac=rosenbrock_grad if method != 'Nelder-Mead' else None,
options={'maxiter': 100})
results[method] = {
'success': result.success,
'iterations': result.nit,
'function_evals': result.nfev,
'final_value': result.fun,
'solution': result.x
}
print("优化算法比较:\n")
print(f"{'方法':<15} {'成功':<8} {'迭代':<8} {'函数评估':<12} {'最终值':<15}")
print("-" * 60)
for method, res in results.items():
print(f"{method:<15} {str(res['success']):<8} {res['iterations']:<8} "
f"{res['function_evals']:<12} {res['final_value']:<15.6e}")
随机优化理论¶
SGD的收敛性:
对于凸函数,步长\(\eta_t = \frac{\eta_0}{\sqrt{t}}\)时:
对于强凸函数,步长\(\eta_t = \frac{1}{\mu t}\)时:
方差缩减方法: - SVRG(Stochastic Variance Reduced Gradient) - SAGA - 收敛速度:\(O(\frac{1}{T})\)(与全梯度方法相同)
import numpy as np
class StochasticOptimizer:
"""
随机优化算法的统一框架
"""
def __init__(self, lr_schedule='constant'):
self.lr_schedule = lr_schedule
def learning_rate(self, t, eta0=0.01):
"""学习率调度"""
if self.lr_schedule == 'constant':
return eta0
elif self.lr_schedule == 'decay':
return eta0 / (1 + 0.1 * t)
elif self.lr_schedule == 'sqrt_decay':
return eta0 / np.sqrt(1 + t)
elif self.lr_schedule == 'inv_t':
return eta0 / (1 + t)
def sgd(self, grad_func, x0, n_iterations=1000, batch_size=1, eta0=0.01):
"""随机梯度下降"""
x = x0.copy()
trajectory = [x.copy()]
for t in range(n_iterations):
eta = self.learning_rate(t, eta0)
# 随机采样梯度(这里简化处理)
grad = grad_func(x, batch_size)
x = x - eta * grad
trajectory.append(x.copy())
return x, trajectory
def svrg(self, grad_func, full_grad_func, x0, n_epochs=10, n_inner=100, eta=0.01):
"""SVRG: 方差缩减随机梯度下降"""
x = x0.copy()
for epoch in range(n_epochs):
# 计算全梯度
mu = full_grad_func(x)
x_snapshot = x.copy()
for t in range(n_inner):
# 随机梯度
grad_current = grad_func(x, 1)
grad_snapshot = grad_func(x_snapshot, 1)
# 方差缩减梯度
v = grad_current - grad_snapshot + mu
x = x - eta * v
return x
# 示例:线性回归的SGD
np.random.seed(42)
n_samples, n_features = 1000, 10
X = np.random.randn(n_samples, n_features)
true_w = np.random.randn(n_features)
y = X @ true_w + 0.1 * np.random.randn(n_samples)
def grad_func(w, batch_size):
"""随机梯度"""
indices = np.random.choice(n_samples, batch_size, replace=False)
X_batch = X[indices]
y_batch = y[indices]
return 2 * X_batch.T @ (X_batch @ w - y_batch) / batch_size
def full_grad_func(w):
"""全梯度"""
return 2 * X.T @ (X @ w - y) / n_samples
optimizer = StochasticOptimizer(lr_schedule='sqrt_decay')
w_sgd, _ = optimizer.sgd(grad_func, np.zeros(n_features),
n_iterations=1000, batch_size=10, eta0=0.01)
print(f"真实参数: {true_w[:3]}")
print(f"SGD估计: {w_sgd[:3]}")
print(f"误差: {np.linalg.norm(w_sgd - true_w):.6f}")
自适应学习率算法¶
AdaGrad:
特点:对稀疏梯度给予更大学习率
RMSprop:
Adam(Adaptive Moment Estimation):
import numpy as np
class AdaptiveOptimizer:
"""
自适应学习率优化算法
"""
def __init__(self, params):
self.params = params
self.t = 0
def step(self, grads):
raise NotImplementedError
class Adam(AdaptiveOptimizer):
"""Adam优化器"""
def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
super().__init__(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
def step(self, grads):
self.t += 1
# 更新一阶矩和二阶矩
self.m = self.beta1 * self.m + (1 - self.beta1) * grads
self.v = self.beta2 * self.v + (1 - self.beta2) * (grads ** 2)
# 偏差修正
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
# 更新参数
self.params = self.params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return self.params
class AdaGrad(AdaptiveOptimizer):
"""AdaGrad优化器"""
def __init__(self, params, lr=0.01, eps=1e-8):
super().__init__(params)
self.lr = lr
self.eps = eps
self.G = np.zeros_like(params)
def step(self, grads):
self.G += grads ** 2
self.params = self.params - self.lr * grads / (np.sqrt(self.G) + self.eps)
return self.params
# 比较不同优化器
np.random.seed(42)
# 病态条件的二次函数
Q = np.diag([1000, 100, 10, 1, 0.1])
b = np.ones(5)
def f(x):
return 0.5 * x.T @ Q @ x - b.T @ x
def grad(x):
return Q @ x - b
x0 = np.ones(5) * 10
optimizers = {
'SGD': (lambda x: x - 0.001 * grad(x)),
'AdaGrad': AdaGrad(x0.copy(), lr=0.1),
'Adam': Adam(x0.copy(), lr=0.01)
}
results = {}
for name, opt in optimizers.items():
if callable(opt):
# 简单SGD
x = x0.copy()
history = [f(x)]
for _ in range(1000):
x = opt(x)
history.append(f(x))
else:
# 自适应方法
x = opt.params.copy()
history = [f(x)]
for _ in range(1000):
g = grad(x)
x = opt.step(g)
history.append(f(x))
results[name] = history
# 绘制收敛曲线
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
for name, history in results.items():
plt.semilogy(history, label=name)
plt.xlabel('Iteration')
plt.ylabel('Function Value (log scale)')
plt.title('Optimizer Comparison on Ill-Conditioned Problem')
plt.legend()
plt.grid(True)
plt.show()
print(f"最优值: {f(np.linalg.solve(Q, b)):.10f}")
for name, history in results.items():
print(f"{name}最终值: {history[-1]:.10f}")
3.4 约束优化¶
投影梯度法¶
问题:\(\min_{\mathbf{x} \in C} f(\mathbf{x})\)
算法:
其中\(\Pi_C(\mathbf{x})\)是\(\mathbf{x}\)在集合\(C\)上的投影
常见投影: - 非负约束:\(\Pi_{\mathbb{R}^+_n}(\mathbf{x}) = \max(\mathbf{x}, 0)\) - L2球:\(\Pi_{\|\mathbf{x}\|_2 \leq r}(\mathbf{x}) = r \frac{\mathbf{x}}{\max(\|\mathbf{x}\|_2, r)}\) - L1球:需要专门算法
import numpy as np
def project_l2_ball(x, radius=1.0):
"""投影到L2球"""
norm = np.linalg.norm(x)
if norm <= radius:
return x
return radius * x / norm
def project_simplex(v, z=1.0):
"""
投影到概率单纯形: {x | x >= 0, sum(x) = z}
使用排序算法,O(n log n)
"""
n = len(v)
u = np.sort(v)[::-1]
cssv = np.cumsum(u) - z
ind = np.arange(n) + 1
cond = u - cssv / ind > 0
rho = ind[cond][-1]
theta = cssv[cond][-1] / rho
return np.maximum(v - theta, 0)
def projected_gradient_descent(grad_func, x0, projection_func,
n_iterations=1000, eta=0.01):
"""投影梯度下降"""
x = x0.copy()
trajectory = []
for t in range(n_iterations):
grad = grad_func(x)
x = x - eta * grad
x = projection_func(x)
trajectory.append(x.copy())
return x, trajectory
# 示例:在概率单纯形上优化
np.random.seed(42)
# 目标: 最小化 ||x - target||^2,约束 x >= 0, sum(x) = 1
target = np.array([0.1, 0.3, 0.4, 0.2])
def grad(x):
return 2 * (x - target)
x0 = np.ones(4) / 4 # 均匀分布作为初始点
x_opt, traj = projected_gradient_descent(
grad, x0, lambda x: project_simplex(x, z=1.0),
n_iterations=500, eta=0.1
)
print(f"目标分布: {target}")
print(f"优化结果: {x_opt}")
print(f"约束检查: sum = {x_opt.sum():.6f}, min = {x_opt.min():.6f}")
print(f"误差: {np.linalg.norm(x_opt - target):.6f}")
近端梯度法(Proximal Gradient)¶
问题:\(\min_\mathbf{x} f(\mathbf{x}) + g(\mathbf{x})\)
其中\(f\)是光滑的,\(g\)是凸的(可能非光滑)
算法:
近端算子:
常见近端算子: - L1正则(软阈值):\(\text{prox}_{\lambda \|\cdot\|_1}(x) = \text{sign}(x) \max(|x| - \lambda, 0)\) - L2正则:\(\text{prox}_{\lambda \|\cdot\|_2^2/2}(x) = \frac{x}{1+\lambda}\)
import numpy as np
def soft_threshold(x, lambda_):
"""软阈值算子(L1近端算子)"""
return np.sign(x) * np.maximum(np.abs(x) - lambda_, 0)
def proximal_gradient_method(grad_f, prox_g, x0,
n_iterations=1000, eta=0.01):
"""
近端梯度法(Proximal Gradient Method)
解决: min f(x) + g(x)
"""
x = x0.copy()
for t in range(n_iterations):
# 梯度步
y = x - eta * grad_f(x)
# 近端步
x = prox_g(y, eta)
return x
# 示例:Lasso回归
# min ||Ax - b||^2 + lambda * ||x||_1
np.random.seed(42)
n, p = 100, 50
A = np.random.randn(n, p)
x_true = np.zeros(p)
x_true[:10] = np.random.randn(10) # 稀疏真实解
b = A @ x_true + 0.1 * np.random.randn(n)
lambda_lasso = 0.1
def grad_f(x):
"""f(x) = ||Ax - b||^2 的梯度"""
return 2 * A.T @ (A @ x - b)
def prox_g(x, eta):
"""g(x) = lambda * ||x||_1 的近端算子"""
return soft_threshold(x, eta * lambda_lasso)
x_lasso = proximal_gradient_method(grad_f, prox_g, np.zeros(p),
n_iterations=2000, eta=0.001)
print(f"Lasso解的非零元素: {np.sum(x_lasso != 0)}")
print(f"真实解的非零元素: {np.sum(x_true != 0)}")
print(f"重构误差: {np.linalg.norm(A @ x_lasso - b):.6f}")
# 与最小二乘比较
x_ls = np.linalg.lstsq(A, b, rcond=None)[0]
print(f"最小二乘解的非零元素: {np.sum(x_ls != 0)}")
print(f"最小二乘重构误差: {np.linalg.norm(A @ x_ls - b):.6f}")
四、信息论¶
4.1 信息熵¶
香农熵¶
定义:
离散随机变量\(X\)的熵:
直观解释: - 熵是随机变量的"不确定性"或"信息量" - 均匀分布时熵最大 - 确定性事件熵为0
联合熵与条件熵: - \(H(X, Y) = -\sum_{x,y} P(x,y) \log P(x,y)\) - \(H(Y|X) = \sum_x P(x) H(Y|X=x) = H(X,Y) - H(X)\)
import numpy as np
from scipy.stats import entropy
def shannon_entropy(p):
"""计算香农熵"""
p = np.array(p)
p = p[p > 0] # 去除0概率(避免log(0))
return -np.sum(p * np.log2(p))
# 示例:比较不同分布的熵
uniform = np.ones(8) / 8
biased = np.array([0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.0078125])
deterministic = np.array([1, 0, 0, 0, 0, 0, 0, 0])
print("熵的比较(单位:bits):")
print(f"均匀分布: {shannon_entropy(uniform):.4f} (最大: {np.log2(8):.4f})")
print(f"偏置分布: {shannon_entropy(biased):.4f}")
print(f"确定性分布: {shannon_entropy(deterministic):.4f}")
# 熵与编码长度的关系
print("\n熵与编码:")
print(f"均匀分布需要平均 {shannon_entropy(uniform):.2f} bits/符号")
print(f"偏置分布需要平均 {shannon_entropy(biased):.2f} bits/符号")
print("说明:利用分布的非均匀性可以压缩数据")
微分熵¶
定义:连续随机变量的熵
注意: - 微分熵可以为负 - 微分熵不是极限情况下离散熵的近似 - 真正的信息量需要考虑量化精度
常见分布的微分熵:
| 分布 | 微分熵 |
|---|---|
| 均匀分布 \(U(a,b)\) | \(\log(b-a)\) |
| 正态分布 \(\mathcal{N}(\mu, \sigma^2)\) | \(\frac{1}{2}\log(2\pi e \sigma^2)\) |
| 指数分布 \(Exp(\lambda)\) | \(1 - \log\lambda\) |
import numpy as np
from scipy.stats import norm, uniform, expon
def differential_entropy_gaussian(sigma2):
"""正态分布的微分熵: 0.5 * log(2*pi*e*sigma^2)"""
return 0.5 * np.log(2 * np.pi * np.e * sigma2)
def estimate_differential_entropy(samples, bins='auto'):
"""用直方图估计微分熵"""
hist, bin_edges = np.histogram(samples, bins=bins, density=True)
bin_widths = np.diff(bin_edges)
# 只考虑非零概率的bin
mask = hist > 0
entropy = -np.sum(hist[mask] * np.log(hist[mask]) * bin_widths[mask])
return entropy
# 验证
sigma2 = 4
samples = np.random.normal(0, np.sqrt(sigma2), 100000)
estimated = estimate_differential_entropy(samples)
theoretical = differential_entropy_gaussian(sigma2)
print(f"正态分布 N(0, {sigma2}) 的微分熵:")
print(f"理论值: {theoretical:.4f}")
print(f"估计值: {estimated:.4f}")
4.2 互信息与KL散度¶
互信息¶
定义:
意义:知道\(Y\)后,\(X\)的不确定性减少了多少
性质: - \(I(X;Y) \geq 0\) - \(I(X;Y) = 0\) 当且仅当 \(X\) 和 \(Y\) 独立 - \(I(X;Y) = I(Y;X)\)
与相关性比较: - 互信息捕捉任意类型的依赖关系 - 相关性只捕捉线性关系
import numpy as np
from sklearn.feature_selection import mutual_info_regression
def compute_mutual_information_discrete(x, y):
"""
计算离散变量的互信息
"""
# 联合分布
joint = np.zeros((len(np.unique(x)), len(np.unique(y))))
for i, xi in enumerate(np.unique(x)):
for j, yj in enumerate(np.unique(y)):
joint[i, j] = np.mean((x == xi) & (y == yj))
# 边缘分布
px = joint.sum(axis=1)
py = joint.sum(axis=0)
# 互信息
mi = 0
for i in range(len(px)):
for j in range(len(py)):
if joint[i, j] > 0:
mi += joint[i, j] * np.log2(joint[i, j] / (px[i] * py[j]))
return mi
# 示例:不同类型的依赖关系
n = 1000
# 1. 线性关系
x1 = np.random.randn(n)
y1 = 2 * x1 + 0.1 * np.random.randn(n)
# 2. 非线性关系
x2 = np.random.randn(n)
y2 = x2**2 + 0.1 * np.random.randn(n)
# 3. 独立
x3 = np.random.randn(n)
y3 = np.random.randn(n)
from scipy.stats import pearsonr
def analyze_relationship(x, y, name):
corr, _ = pearsonr(x, y)
mi = mutual_info_regression(x.reshape(-1, 1), y)[0]
print(f"{name}:")
print(f" 相关系数: {corr:.4f}")
print(f" 互信息: {mi:.4f}")
print()
print("互信息与相关系数比较:\n")
analyze_relationship(x1, y1, "线性关系")
analyze_relationship(x2, y2, "非线性关系(二次)")
analyze_relationship(x3, y3, "独立")
KL散度(相对熵)¶
定义:
性质: - \(D_{KL}(P \| Q) \geq 0\)(Gibbs不等式) - \(D_{KL}(P \| Q) = 0\) 当且仅当 \(P = Q\) - 不对称:\(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\)
变分表示:
应用: - 变分推断 - 策略梯度(REINFORCE算法) - 生成模型(VAE)
import numpy as np
from scipy.stats import norm, entropy
def kl_divergence_gaussian(mu1, sigma1, mu2, sigma2):
"""
两个正态分布之间的KL散度
D_KL(N(mu1, sigma1^2) || N(mu2, sigma2^2))
= log(sigma2/sigma1) + (sigma1^2 + (mu1-mu2)^2)/(2*sigma2^2) - 0.5
"""
return (np.log(sigma2 / sigma1) +
(sigma1**2 + (mu1 - mu2)**2) / (2 * sigma2**2) - 0.5)
def kl_divergence_discrete(p, q):
"""离散分布的KL散度"""
p = np.array(p)
q = np.array(q)
# 只考虑p>0的位置
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
# 示例1:正态分布
mu1, sigma1 = 0, 1
mu2, sigma2 = 1, 2
kl_forward = kl_divergence_gaussian(mu1, sigma1, mu2, sigma2)
kl_reverse = kl_divergence_gaussian(mu2, sigma2, mu1, sigma1)
print(f"KL散度 P||Q: {kl_forward:.4f}")
print(f"KL散度 Q||P: {kl_reverse:.4f}")
print(f"不对称性: {abs(kl_forward - kl_reverse):.4f}")
📚 参考文献¶
核心论文¶
线性代数与优化¶
- Matrix Computations - Golub & Van Loan, 2013
-
矩阵计算的权威教材
-
Convex Optimization - Boyd & Vandenberghe, 2004
-
凸优化领域的经典教材
-
The Elements of Statistical Learning - Hastie et al., 2009
- 统计学习理论奠基之作
概率与统计¶
- Pattern Recognition and Machine Learning - Bishop, 2006
-
贝叶斯方法的经典教材
-
Information Theory, Inference, and Learning Algorithms - MacKay, 2003
-
信息论与机器学习的结合
-
Variational Inference: A Review for Statisticians - Blei et al., 2017
- 变分推断综述
机器学习算法¶
- A Decision-Theoretic Generalization of the k-Nearest Neighbor Rule - Cover & Hart, 1967
-
k-NN算法理论基础
-
Random Forests - Breiman, 2001
-
随机森林算法
-
Regularization and Variable Selection via the Elastic Net - Zou & Hastie, 2005
-
Elastic Net正则化方法
-
Greedy Function Approximation: A Gradient Boosting Machine - Friedman, 2001
- 梯度提升算法
-
LightGBM: A Highly Efficient Gradient Boosting Decision Tree - Ke et al., 2017
- LightGBM算法
-
XGBoost: A Scalable Tree Boosting System - Chen & Guestrin, 2016
- XGBoost算法
深度学习¶
-
Deep Learning - Goodfellow et al., 2016
- 深度学习领域的"圣经"
-
ImageNet Classification with Deep Convolutional Neural Networks - Krizhevsky et al., 2012
- AlexNet,深度学习复兴的开端
-
Attention Is All You Need - Vaswani et al., 2017
- Transformer开山之作
技术博客¶
中文博客¶
- 李沐:动手学深度学习 - 系统的深度学习教程
💡 以下为推荐的B站UP主,请在B站直接搜索其名称获取最新内容。
- 3Blue1Brown - 数学可视化(在B站搜索"3Blue1Brown")
- 同济子豪兄 - 机器学习数学基础(在B站搜索"同济子豪兄")
- Datawhale - 数据科学数学基础(在B站搜索"Datawhale")
英文博客¶
- 3Blue1Brown - 数学可视化
- Distill.pub - 可视化机器学习研究
- BetterExplained - 直观理解数学概念
- StatQuest with Josh Starmer - 机器学习概念直观解释
- Jeremy Howard's Blog - 实用深度学习
开源项目¶
数学与科学计算¶
机器学习¶
- scikit-learn - Python机器学习库
- XGBoost - 梯度提升框架
- LightGBM - 轻量级梯度提升
- CatBoost - Yandex的梯度提升库
- PyTorch - 深度学习框架
可视化与工具¶
- Matplotlib - Python绘图库
- Seaborn - 统计数据可视化
- Plotly - 交互式可视化
- TensorBoard - TensorFlow可视化工具
- Weights & Biases - 实验追踪和可视化
参考书籍¶
中文书籍¶
- 《线性代数》- 同济大学数学系 编,高等教育出版社
-
国内线性代数经典教材
-
《概率论与数理统计》- 浙江大学 编,高等教育出版社
-
概率统计经典教材
-
《高等数学》- 同济大学数学系 编,高等教育出版社
-
微积分经典教材
-
《机器学习》- 周志华 著,清华大学出版社
-
"西瓜书",国内机器学习经典教材
-
《统计学习方法》- 李航 著,清华大学出版社
-
机器学习基础理论的权威教材
-
《深度学习》- Ian Goodfellow、Yoshua Bengio、Aaron Courville 著,人民邮电出版社
-
深度学习领域的"圣经"
-
《凸优化》- Stephen Boyd、Lieven Vandenberghe 著,清华大学出版社
-
凸优化领域的经典教材
-
《信息论基础》- Thomas M. Cover、Joy A. Thomas 著,机械工业出版社
- 信息论领域的经典教材
英文书籍¶
- "Linear Algebra and Its Applications" - Gilbert Strang
-
Wellesley-Cambridge Press,线性代数经典教材
-
"Introduction to Linear Algebra" - Gilbert Strang
-
Wellesley-Cambridge Press,线性代数入门教材
-
"Probability and Statistics" - Morris H. DeGroot, Mark J. Schervish
-
Addison-Wesley,概率统计经典教材
-
"Convex Optimization" - Stephen Boyd, Lieven Vandenberghe
-
Cambridge University Press,凸优化经典教材
-
"Pattern Recognition and Machine Learning" - Christopher M. Bishop
-
Springer,贝叶斯方法经典教材
-
"The Elements of Statistical Learning" - Trevor Hastie, Robert Tibshirani, Jerome Friedman
-
Springer,统计学习理论奠基之作
-
"Deep Learning" - Ian Goodfellow, Yoshua Bengio, Aaron Courville
-
MIT Press,深度学习领域的权威教材
-
"Mathematics for Machine Learning" - Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong
- Cambridge University Press,机器学习数学基础
在线课程¶
中文课程¶
💡 以下为推荐的B站UP主和搜索关键词,请在B站直接搜索获取最新内容。
- 吴恩达机器学习课程 - 在B站搜索"吴恩达 机器学习"
- 3Blue1Brown数学可视化 - 在B站搜索"3Blue1Brown 线性代数"
- 同济子豪兄 - 在B站搜索"同济子豪兄 数学基础"
- 李沐 - 在B站搜索"李沐 机器学习"
英文课程¶
- Mathematics for Machine Learning Specialization (Coursera) - 机器学习数学专项
- Linear Algebra for Machine Learning (Coursera) - 线性代数
- Multivariable Calculus for Machine Learning (Coursera) - 多元微积分
- 3Blue1Brown Essence of Linear Algebra - 线性代数可视化
- Khan Academy Mathematics - 数学基础课程
社区资源¶
中文社区¶
英文社区¶
- Mathematics Stack Exchange - 数学问答社区
- Cross Validated (Stats Stack Exchange) - 统计学问答社区
- Machine Learning Subreddit - 机器学习讨论
- Papers with Code - 论文与代码实现
- arXiv.org - 最新机器学习论文
论坛与问答¶
- Stack Overflow Mathematics - 数学技术问答
- Stack Overflow Machine Learning - 机器学习技术问答
- SciPy Stack Exchange - 科学计算问答
- PyTorch Forum - PyTorch官方论坛
邮件列表与Slack¶
- Distill.pub - 可视化研究
- PyTorch Slack - PyTorch社区
- SciPy Conference - SciPy年度会议