跳转至

00 - 数学基础

机器学习数学基础图

机器学习的数学基石:从基础到前沿,从应用到理论


🎥 视频教程链接

中文视频教程

B站推荐

💡 以下为推荐的UP主和搜索关键词,请在B站直接搜索获取最新内容。

推荐UP主(在B站搜索其名称即可找到): - 3Blue1Brown - 线性代数、微积分、概率论的直觉可视化 - 李沐 - 机器学习数学推导和论文精读 - StatQuest - 机器学习统计概念直观讲解(有中文字幕版)

推荐搜索关键词: - "线性代数 本质"、"矩阵 直觉理解" - "概率论 机器学习"、"最大似然估计 详解" - "梯度下降 优化 直觉"、"反向传播 推导"

国内MOOC平台

💡 以下为推荐平台,请在平台内搜索相关课程名称。

英文视频教程

YouTube优质频道

Coursera课程

Udemy课程

edX课程


💻 在线练习平台

数学互动学习

  • Khan Academy - 线性代数、微积分、概率论互动练习
  • Brilliant - 数学和科学互动学习,有机器学习数学专题
  • Coursera - 搜索"Mathematics for Machine Learning"等课程
  • edX - MIT线性代数、概率统计等免费课程

数学计算与可视化工具

  • Wolfram Alpha - 数学计算、公式验证、符号运算
  • Desmos - 函数图像可视化,理解优化曲面
  • GeoGebra - 几何与代数可视化,理解线性变换
  • SymPy Live - Python符号数学在线环境

目录

  1. 线性代数
  2. 基础概念
  3. 矩阵微积分
  4. 张量基础
  5. 高级SVD应用
  6. 谱图理论
  7. 概率与统计
  8. 基础概率论
  9. 测度论基础
  10. 多元高斯分布
  11. 指数族分布
  12. 随机过程
  13. 统计推断
  14. 微积分与优化
  15. 微积分基础
  16. 凸优化理论
  17. 高级优化算法
  18. 约束优化
  19. 信息论
  20. 信息熵
  21. 互信息与KL散度
  22. 信息瓶颈理论

一、线性代数

1.1 基础概念

向量与矩阵

向量的几何意义: - 向量不仅是数字的排列,更是有向线段,具有大小方向 - 在机器学习中,向量表示特征空间中的点方向

矩阵的几何意义: - 矩阵表示线性变换:旋转、缩放、投影、剪切 - 每个矩阵都可以分解为基本的线性变换组合

Python
import numpy as np
import matplotlib.pyplot as plt

# 可视化线性变换
A = np.array([[2, 1], [1, 2]])  # 对称矩阵
eigenvalues, eigenvectors = np.linalg.eig(A)  # np.linalg线性代数运算

print(f"矩阵 A:\n{A}")
print(f"\n特征值: {eigenvalues}")
print(f"特征向量:\n{eigenvectors}")

# 几何解释:特征向量是变换中方向不变的向量
# 特征值表示在该方向上的缩放因子

特征值分解的深层理解

特征值分解的本质

\[A = Q \Lambda Q^{-1}\]

其中: - \(Q\):特征向量矩阵,表示新的坐标系 - \(\Lambda\):对角矩阵,表示在各特征方向上的缩放

物理意义: - 特征向量定义了变换的主轴 - 特征值表示沿各主轴的伸缩比例 - 对称矩阵的特征向量构成正交基

应用: - PCA:数据的主成分就是协方差矩阵的特征向量 - PageRank:网页重要性是链接矩阵的主特征向量 - 振动分析:系统的固有频率和模态

Python
# PCA的数学本质:特征值分解
def pca_eig(X, n_components=2):
    """基于特征值分解的PCA"""
    # 中心化
    X_centered = X - np.mean(X, axis=0)

    # 协方差矩阵
    cov_matrix = np.cov(X_centered.T)

    # 特征值分解(协方差矩阵是对称矩阵,用eigh更稳定且保证实数结果)
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # 按特征值排序
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # 投影
    components = eigenvectors[:, :n_components]
    X_transformed = X_centered @ components

    return X_transformed, eigenvalues, eigenvectors

奇异值分解(SVD)的深层理解

SVD的本质

\[A = U \Sigma V^T\]

几何解释: 1. \(V^T\):在输入空间旋转到标准正交基 2. \(\Sigma\):在各坐标轴上缩放 3. \(U\):在输出空间旋转

SVD与特征值分解的关系: - \(A^TA = V \Sigma^2 V^T\)(右奇异向量是\(A^TA\)的特征向量) - \(AA^T = U \Sigma^2 U^T\)(左奇异向量是\(AA^T\)的特征向量)

应用: - 低秩近似:保留前k个奇异值 - 伪逆\(A^+ = V \Sigma^+ U^T\) - 推荐系统:矩阵分解

Python
# SVD低秩近似
def low_rank_approximation(A, k):
    """SVD低秩近似"""
    U, s, Vt = np.linalg.svd(A, full_matrices=False)

    # 保留前k个奇异值
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]

    # 重构
    A_approx = U_k @ np.diag(s_k) @ Vt_k

    # 计算近似误差
    error = np.linalg.norm(A - A_approx, 'fro') / np.linalg.norm(A, 'fro')

    return A_approx, error, s

# 示例:图像压缩
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
gray = np.mean(china, axis=2)

A_approx, error, singular_values = low_rank_approximation(gray, k=50)
print(f"压缩率: {50 * (gray.shape[0] + gray.shape[1]) / (gray.shape[0] * gray.shape[1]):.2%}")
print(f"近似误差: {error:.4f}")

1.2 矩阵微积分

向量求导基础

基本定义: - 标量对向量求导:\(\frac{\partial f}{\partial \mathbf{x}} = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}\right]^T\) - 向量对向量求导(Jacobian矩阵) - 标量对矩阵求导

常用公式

表达式 结果 条件
\(\frac{\partial (\mathbf{a}^T\mathbf{x})}{\partial \mathbf{x}}\) \(\mathbf{a}\)
\(\frac{\partial (\mathbf{x}^T\mathbf{A}\mathbf{x})}{\partial \mathbf{x}}\) \((\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
\(\frac{\partial (\mathbf{A}\mathbf{x})}{\partial \mathbf{x}}\) \(\mathbf{A}^T\)
\(\frac{\partial (\mathbf{x}^T\mathbf{A})}{\partial \mathbf{x}}\) \(\mathbf{A}\)
\(\frac{\partial \lVert\mathbf{x}\rVert^2}{\partial \mathbf{x}}\) \(2\mathbf{x}\)
\(\frac{\partial (\mathbf{x}^T\mathbf{x})}{\partial \mathbf{x}}\) \(2\mathbf{x}\)

链式法则的矩阵形式

\(\mathbf{y} = f(\mathbf{g}(\mathbf{x}))\),则:

\[\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}}\]
Python
# 矩阵求导在机器学习中的应用:线性回归的闭式解

def linear_regression_closed_form(X, y):
    """
    最小二乘法的闭式解
    目标:min ||Xw - y||^2
    求导:2X^T(Xw - y) = 0
    解得:w = (X^T X)^(-1) X^T y
    """
    return np.linalg.inv(X.T @ X) @ X.T @ y

# 梯度推导验证
def verify_gradient():
    """验证梯度计算"""
    X = np.random.randn(100, 5)
    w_true = np.random.randn(5)
    y = X @ w_true + 0.1 * np.random.randn(100)

    # 数值梯度
    def loss(w):
        return np.sum((X @ w - y) ** 2)

    def numerical_gradient(w, eps=1e-5):
        grad = np.zeros_like(w)
        for i in range(len(w)):
            w_plus = w.copy()
            w_plus[i] += eps
            w_minus = w.copy()
            w_minus[i] -= eps
            grad[i] = (loss(w_plus) - loss(w_minus)) / (2 * eps)
        return grad

    w = np.random.randn(5)
    numerical_grad = numerical_gradient(w)
    analytical_grad = 2 * X.T @ (X @ w - y)

    print(f"数值梯度: {numerical_grad}")
    print(f"解析梯度: {analytical_grad}")
    print(f"误差: {np.linalg.norm(numerical_grad - analytical_grad)}")

verify_gradient()

二阶导数与Hessian矩阵

Hessian矩阵

\[H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\]

性质: - 对称矩阵(当二阶偏导连续时) - 正定 → 局部最小值 - 负定 → 局部最大值 - 不定 → 鞍点

应用: - 牛顿法:利用Hessian矩阵加速收敛 - 曲率分析:理解损失函数的几何形状

Python
# Hessian矩阵在优化中的应用

def compute_hessian(f, x, eps=1e-5):
    """数值计算Hessian矩阵"""
    n = len(x)
    hessian = np.zeros((n, n))

    for i in range(n):
        for j in range(n):
            x_pp = x.copy()
            x_pp[i] += eps
            x_pp[j] += eps

            x_pm = x.copy()
            x_pm[i] += eps
            x_pm[j] -= eps

            x_mp = x.copy()
            x_mp[i] -= eps
            x_mp[j] += eps

            x_mm = x.copy()
            x_mm[i] -= eps
            x_mm[j] -= eps

            hessian[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)

    return hessian

# 示例:二次函数的Hessian
f = lambda x: x[0]**2 + 2*x[1]**2 + 3*x[0]*x[1]
x = np.array([1.0, 1.0])
hessian = compute_hessian(f, x)
print(f"Hessian矩阵:\n{hessian}")
print(f"特征值: {np.linalg.eigvals(hessian)}")

1.3 张量基础

张量的定义与运算

张量的阶(Rank): - 0阶:标量 - 1阶:向量 - 2阶:矩阵 - 3阶及以上:高阶张量

张量运算: - 张量积(Outer Product)\(\mathbf{a} \otimes \mathbf{b} \otimes \mathbf{c}\) - 缩并(Contraction):对指定维度求和 - 张量网络:用图表示张量运算

Python
import numpy as np

# 张量基础操作
A = np.random.randn(3, 4, 5)  # 3阶张量
B = np.random.randn(4, 5, 6)  # 3阶张量

# 张量缩并(在维度1和2上)
C = np.tensordot(A, B, axes=([1, 2], [0, 1]))
print(f"A形状: {A.shape}, B形状: {B.shape}")
print(f"缩并后C形状: {C.shape}")

# Einstein求和约定
def batch_matrix_multiply(A, B):
    """批量矩阵乘法: (batch, m, n) @ (batch, n, p) -> (batch, m, p)"""
    return np.einsum('bmn,bnp->bmp', A, B)

A_batch = np.random.randn(10, 3, 4)
B_batch = np.random.randn(10, 4, 5)
C_batch = batch_matrix_multiply(A_batch, B_batch)
print(f"批量矩阵乘法结果形状: {C_batch.shape}")

张量分解

CP分解(CANDECOMP/PARAFAC)

\[\mathcal{X} \approx \sum_{r=1}^{R} \lambda_r \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r\]

Tucker分解

\[\mathcal{X} \approx \mathcal{G} \times_1 A \times_2 B \times_3 C\]

应用: - 高维数据压缩 - 推荐系统(张量补全) - 神经网络的低秩近似

💡 安装依赖:运行以下代码前需要先安装tensorly库:

Bash
pip install tensorly

Python
from tensorly.decomposition import parafac, tucker
import tensorly as tl

# CP分解
X = np.random.randn(10, 20, 30)
factors = parafac(X, rank=5)
print("CP分解因子形状:")
for i, factor in enumerate(factors.factors):  # enumerate同时获取索引和元素
    print(f"  因子{i}: {factor.shape}")

# Tucker分解
core, factors = tucker(X, ranks=[5, 10, 15])
print(f"\nTucker分解核心张量形状: {core.shape}")
for i, factor in enumerate(factors):
    print(f"  因子矩阵{i}: {factor.shape}")

1.4 高级SVD应用

矩阵补全(Matrix Completion)

问题设置: - 已知矩阵的部分元素 - 目标是恢复完整矩阵 - 假设矩阵是低秩的

算法: - 奇异值阈值(SVT) - 交替最小二乘(ALS)

Python
def matrix_completion_svt(M, mask, rank=5, max_iter=100, tau=1.0):
    """
    奇异值阈值算法进行矩阵补全

    M: 观测矩阵(含缺失值设为0)
    mask: 观测掩码(1表示观测到,0表示缺失)
    """
    X = M.copy()

    for iteration in range(max_iter):
        # SVD
        U, s, Vt = np.linalg.svd(X, full_matrices=False)

        # 软阈值(Soft Thresholding)
        s_threshold = np.maximum(s - tau, 0)

        # 重构
        X_new = U @ np.diag(s_threshold) @ Vt

        # 保持观测值不变
        X = mask * M + (1 - mask) * X_new

        # 检查收敛
        if iteration % 10 == 0:
            error = np.linalg.norm(mask * (X - M), 'fro')
            print(f"迭代 {iteration}: 误差 = {error:.6f}")

    return X

# 示例
M_true = np.random.randn(50, 40) @ np.random.randn(40, 60)  # 秩40矩阵
mask = np.random.rand(50, 60) > 0.7  # 70%缺失
M_observed = M_true * mask

M_completed = matrix_completion_svt(M_observed, mask, rank=10)
error = np.linalg.norm(M_completed - M_true, 'fro') / np.linalg.norm(M_true, 'fro')
print(f"\n补全相对误差: {error:.4f}")

鲁棒PCA

问题:将矩阵分解为低秩部分和稀疏部分

\[M = L + S\]

其中\(L\)是低秩的,\(S\)是稀疏的(异常值)

Python
def robust_pca(M, lambda_=0.01, max_iter=100):
    """
    鲁棒PCA(Principal Component Pursuit)
    """
    L = np.zeros_like(M)
    S = np.zeros_like(M)
    Y = np.zeros_like(M)
    mu = 1.25 / np.linalg.norm(M, 2)

    for iteration in range(max_iter):
        # 更新L(低秩部分)
        U, s, Vt = np.linalg.svd(M - S + Y/mu, full_matrices=False)
        s_threshold = np.maximum(s - 1/mu, 0)
        L = U @ np.diag(s_threshold) @ Vt

        # 更新S(稀疏部分)
        temp = M - L + Y/mu
        S = np.sign(temp) * np.maximum(np.abs(temp) - lambda_/mu, 0)

        # 更新对偶变量
        Y = Y + mu * (M - L - S)

        # 增大mu
        mu = min(mu * 1.5, 1e6)

        if iteration % 10 == 0:
            error = np.linalg.norm(M - L - S, 'fro') / np.linalg.norm(M, 'fro')
            print(f"迭代 {iteration}: 相对误差 = {error:.6f}")

    return L, S

# 示例:去除图像中的噪声
M_clean = np.random.randn(100, 100) @ np.random.randn(100, 100)
noise = np.random.randn(100, 100) * 0.1
# 添加稀疏异常值
outliers = np.random.rand(100, 100) < 0.05
noise[outliers] = np.random.randn(np.sum(outliers)) * 5
M_noisy = M_clean + noise

L, S = robust_pca(M_noisy)
print(f"\n低秩部分秩: {np.linalg.matrix_rank(L, tol=1e-3)}")
print(f"稀疏部分非零元素比例: {np.mean(S != 0):.2%}")

1.5 谱图理论

图拉普拉斯矩阵

定义: - 邻接矩阵\(A\)\(A_{ij} = 1\)如果节点\(i\)\(j\)相连 - 度矩阵\(D\)\(D_{ii} = \sum_j A_{ij}\) - 拉普拉斯矩阵:\(L = D - A\)

归一化拉普拉斯: - 对称归一化:\(L_{sym} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}\) - 随机游走归一化:\(L_{rw} = D^{-1} L = I - D^{-1} A\)

性质: - \(L\)是半正定的 - 最小特征值为0,对应特征向量为全1向量 - 特征值0的重数等于图的连通分量数

Python
import scipy.sparse as sp
from scipy.sparse.linalg import eigsh

def compute_graph_laplacian(adj_matrix, norm_type='sym'):
    """
    计算图拉普拉斯矩阵

    adj_matrix: 邻接矩阵(可以是稀疏矩阵)
    norm_type: 'sym'(对称归一化)或'rw'(随机游走)
    """
    # 度矩阵
    degrees = np.array(adj_matrix.sum(axis=1)).flatten()
    D = sp.diags(degrees)

    # 未归一化拉普拉斯
    L = D - adj_matrix

    if norm_type == 'sym':
        # 对称归一化
        D_inv_sqrt = sp.diags(1.0 / np.sqrt(degrees + 1e-10))
        L_norm = D_inv_sqrt @ L @ D_inv_sqrt
    elif norm_type == 'rw':
        # 随机游走归一化
        D_inv = sp.diags(1.0 / (degrees + 1e-10))
        L_norm = D_inv @ L
    else:
        L_norm = L

    return L_norm

# 示例:圆环图的拉普拉斯
n = 20
adj = sp.diags([np.ones(n-1), np.ones(n-1), np.ones(n), np.ones(n)],
               offsets=[-1, 1, 1-n, n-1], shape=(n, n))
L = compute_graph_laplacian(adj, norm_type='sym')

# 计算特征值
eigenvalues, eigenvectors = eigsh(L, k=5, which='SM')
print(f"最小的5个特征值: {eigenvalues}")

图傅里叶变换

定义: - 经典傅里叶变换:将函数分解为正弦波的叠加 - 图傅里叶变换:将图信号分解为拉普拉斯特征向量的叠加

变换公式: - 正变换:\(\hat{f}(\lambda_l) = \sum_{i=1}^N f(i) u_l(i)\) - 逆变换:\(f(i) = \sum_{l=0}^{N-1} \hat{f}(\lambda_l) u_l(i)\)

应用: - 图卷积网络(GCN)的理论基础 - 图信号处理 - 谱聚类

Python
def graph_fourier_transform(signal, eigenvectors):
    """
    图傅里叶变换

    signal: 图信号 (N,)
    eigenvectors: 拉普拉斯矩阵的特征向量 (N, N)
    """
    return eigenvectors.T @ signal

def inverse_graph_fourier_transform(fourier_coeffs, eigenvectors):
    """
    逆图傅里叶变换
    """
    return eigenvectors @ fourier_coeffs

# 示例:在图上平滑信号
n = 100
# 创建路径图
adj = sp.diags([np.ones(n-1), np.ones(n-1)], offsets=[-1, 1])
L = compute_graph_laplacian(adj)

# 计算特征向量
eigenvalues, eigenvectors = eigsh(L, k=n-1, which='SM')
eigenvectors = np.column_stack([np.ones(n) / np.sqrt(n), eigenvectors])
eigenvalues = np.concatenate([[0], eigenvalues])

# 创建带噪声的信号
signal_clean = np.sin(2 * np.pi * np.arange(n) / n * 3)
signal_noisy = signal_clean + 0.5 * np.random.randn(n)

# 图傅里叶变换
fourier_coeffs = graph_fourier_transform(signal_noisy, eigenvectors)

# 低通滤波:保留低频成分
k = 10  # 保留前k个频率
fourier_filtered = fourier_coeffs.copy()
fourier_filtered[k:] = 0

# 逆变换
signal_filtered = inverse_graph_fourier_transform(fourier_filtered, eigenvectors)

print(f"去噪前MSE: {np.mean((signal_noisy - signal_clean)**2):.4f}")
print(f"去噪后MSE: {np.mean((signal_filtered - signal_clean)**2):.4f}")

二、概率与统计

2.1 基础概率论

概率空间与随机变量

概率空间的三要素\((\Omega, \mathcal{F}, P)\) - \(\Omega\):样本空间(所有可能结果的集合) - \(\mathcal{F}\):事件空间(\(\Omega\)的子集的集合) - \(P\):概率测度

随机变量的类型: - 离散型:概率质量函数(PMF)\(P(X = x)\) - 连续型:概率密度函数(PDF)\(f(x)\)\(P(a < X < b) = \int_a^b f(x) dx\)

期望与方差: - 期望:\(E[X] = \int x f(x) dx\)(连续)或 \(\sum x P(X=x)\)(离散) - 方差:\(Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2\)

Python
from scipy import stats
import numpy as np

# 常见分布
# 1. 正态分布
normal_dist = stats.norm(loc=0, scale=1)
print(f"正态分布期望: {normal_dist.mean()}")
print(f"正态分布方差: {normal_dist.var()}")

# 2. 二项分布
binomial_dist = stats.binom(n=10, p=0.3)
print(f"\n二项分布期望: {binomial_dist.mean()}")
print(f"二项分布方差: {binomial_dist.var()}")

# 3. 泊松分布
poisson_dist = stats.poisson(mu=3)
print(f"\n泊松分布期望: {poisson_dist.mean()}")
print(f"泊松分布方差: {poisson_dist.var()}")

# 大数定律验证
sample_sizes = [10, 100, 1000, 10000, 100000]
for n in sample_sizes:
    samples = np.random.randn(n)
    sample_mean = np.mean(samples)
    print(f"样本量 {n:6d}: 样本均值 = {sample_mean:.6f}")

条件概率与贝叶斯定理

条件概率\(P(A|B) = \frac{P(A \cap B)}{P(B)}\),其中\(P(B) > 0\)

贝叶斯定理

\[P(H|E) = \frac{P(E|H) P(H)}{P(E)}\]

其中: - \(P(H)\):先验概率 - \(P(E|H)\):似然 - \(P(H|E)\):后验概率 - \(P(E)\):证据(边际似然)

应用:朴素贝叶斯分类器

Python
# 贝叶斯定理应用:垃圾邮件分类
class NaiveBayesClassifier:
    """
    多项式朴素贝叶斯(用于文本分类)
    """
    def __init__(self, alpha=1.0):  # __init__构造方法,创建对象时自动调用
        self.alpha = alpha  # 拉普拉斯平滑参数
        self.class_priors = {}
        self.word_probs = {}
        self.vocab = set()

    def fit(self, X, y):
        """
        X: 文档列表,每个文档是词列表
        y: 类别标签
        """
        n_samples = len(X)
        classes = np.unique(y)

        # 构建词汇表
        for doc in X:
            self.vocab.update(doc)
        self.vocab = list(self.vocab)
        word_to_idx = {word: i for i, word in enumerate(self.vocab)}

        for c in classes:
            # 计算先验概率 P(y=c)
            class_docs = [X[i] for i in range(n_samples) if y[i] == c]
            self.class_priors[c] = len(class_docs) / n_samples

            # 计算词频
            word_counts = np.zeros(len(self.vocab))
            for doc in class_docs:
                for word in doc:
                    word_counts[word_to_idx[word]] += 1

            # 计算条件概率 P(word|y=c)(带平滑)
            total_words = word_counts.sum()
            self.word_probs[c] = (word_counts + self.alpha) / (total_words + self.alpha * len(self.vocab))

        self.word_to_idx = word_to_idx
        return self

    def predict(self, X):
        predictions = []
        for doc in X:
            scores = {}
            for c in self.class_priors:
                # log P(y=c) + sum(log P(word|y=c))
                score = np.log(self.class_priors[c])
                for word in doc:
                    if word in self.word_to_idx:
                        score += np.log(self.word_probs[c][self.word_to_idx[word]])
                scores[c] = score
            predictions.append(max(scores, key=scores.get))
        return predictions

# 示例
docs = [
    ['free', 'win', 'money', 'now'],
    ['meeting', 'schedule', 'tomorrow'],
    ['win', 'prize', 'free'],
    ['project', 'deadline', 'meeting']
]
labels = ['spam', 'ham', 'spam', 'ham']

clf = NaiveBayesClassifier()
clf.fit(docs, labels)
test_doc = [['free', 'win']]
prediction = clf.predict(test_doc)
print(f"测试文档分类结果: {prediction[0]}")

2.2 测度论基础(概念性介绍)

为什么要学测度论?

测度论解决的问题: 1. 不可数集上的概率:连续随机变量的严格定义 2. 复杂事件的概率:极限、积分、期望的严格定义 3. 大数定律和中心极限定理的严格证明

核心概念: - σ-代数(σ-algebra):可测事件的集合 - 测度(Measure):集合的"大小"(长度、面积、体积的推广) - 可测函数:可以积分的函数 - 勒贝格积分:比黎曼积分更一般的积分

与机器学习的联系: - 泛化误差的严格定义 - 经验风险最小化的理论基础 - 随机过程的严格处理

Text Only
直观理解:
- 测度论是概率论的"基础设施"
- 就像实数理论是微积分的基础
- 对于应用机器学习,理解概念即可,不需要深入证明

2.3 多元高斯分布

定义与性质

概率密度函数

\[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)\]

几何解释: - 均值\(\boldsymbol{\mu}\):分布的中心 - 协方差\(\Sigma\):决定分布的形状和方向 - 等高线:椭圆(马氏距离为常数的点)

重要性质: 1. 边缘分布仍是高斯分布 2. 条件分布仍是高斯分布 3. 线性变换后仍是高斯分布 4. 独立 ↔ 不相关(高斯分布特有的性质)

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

# 多元高斯分布可视化
def plot_gaussian_2d(mean, cov, title):
    x, y = np.mgrid[-3:3:.01, -3:3:.01]
    pos = np.dstack((x, y))
    rv = multivariate_normal(mean, cov)

    plt.figure(figsize=(8, 6))
    plt.contourf(x, y, rv.pdf(pos), levels=20, cmap='viridis')
    plt.colorbar(label='Probability Density')
    plt.title(title)
    plt.xlabel('x1')
    plt.ylabel('x2')

    # 绘制特征向量
    eigenvalues, eigenvectors = np.linalg.eig(cov)
    for i in range(2):
        vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i])
        plt.arrow(mean[0], mean[1], vec[0], vec[1],
                 head_width=0.1, head_length=0.1, fc='red', ec='red')

    plt.show()

# 不同协方差矩阵的形状
plot_gaussian_2d([0, 0], [[1, 0], [0, 1]], '各向同性(球形)')
plot_gaussian_2d([0, 0], [[2, 0], [0, 0.5]], '轴对齐椭圆')
plot_gaussian_2d([0, 0], [[1, 0.8], [0.8, 1]], '旋转椭圆(正相关)')
plot_gaussian_2d([0, 0], [[1, -0.8], [-0.8, 1]], '旋转椭圆(负相关)')

条件分布与边缘分布

条件分布

\(\mathbf{x} = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{bmatrix}\),则:

\[\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \Sigma_{1|2})\]

其中: - \(\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12} \Sigma_{22}^{-1} (\mathbf{a} - \boldsymbol{\mu}_2)\) - \(\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}\)

应用:高斯过程回归、卡尔曼滤波

Python
def conditional_gaussian(mu, Sigma, x2_val, idx1, idx2):
    """
    计算多元高斯的条件分布

    mu: 均值向量
    Sigma: 协方差矩阵
    x2_val: 观测到的x2值
    idx1: 需要推断的变量索引
    idx2: 观测变量的索引
    """
    mu1 = mu[idx1]
    mu2 = mu[idx2]
    Sigma11 = Sigma[np.ix_(idx1, idx1)]
    Sigma12 = Sigma[np.ix_(idx1, idx2)]
    Sigma21 = Sigma[np.ix_(idx2, idx1)]
    Sigma22 = Sigma[np.ix_(idx2, idx2)]

    # 条件均值
    mu_cond = mu1 + Sigma12 @ np.linalg.inv(Sigma22) @ (x2_val - mu2)

    # 条件协方差
    Sigma_cond = Sigma11 - Sigma12 @ np.linalg.inv(Sigma22) @ Sigma21

    return mu_cond, Sigma_cond

# 示例
mu = np.array([0, 0])
Sigma = np.array([[1, 0.8], [0.8, 1]])

# 已知 x2 = 1.5,求 x1 的条件分布
x2_val = np.array([1.5])
mu_cond, Sigma_cond = conditional_gaussian(mu, Sigma, x2_val, [0], [1])

print(f"条件分布: x1 | x2=1.5 ~ N({mu_cond[0]:.3f}, {Sigma_cond[0,0]:.3f})")
print(f"先验分布: x1 ~ N({mu[0]:.3f}, {Sigma[0,0]:.3f})")
print(f"观测x2后,x1的不确定性从{Sigma[0,0]:.3f}降低到{Sigma_cond[0,0]:.3f}")

2.4 指数族分布

统一理论框架

标准形式

\[f(x|\boldsymbol{\theta}) = h(x) \exp\left(\boldsymbol{\eta}(\boldsymbol{\theta})^T \mathbf{T}(x) - A(\boldsymbol{\theta})\right)\]

其中: - \(\boldsymbol{\eta}(\boldsymbol{\theta})\):自然参数 - \(\mathbf{T}(x)\):充分统计量 - \(A(\boldsymbol{\theta})\):对数配分函数(保证归一化) - \(h(x)\):基础测度

常见分布的指数族形式

分布 自然参数 充分统计量 对数配分函数
正态分布 \([\mu/\sigma^2, -1/(2\sigma^2)]\) \([x, x^2]\) \(\mu^2/(2\sigma^2) + \log\sigma\)
伯努利分布 \(\log(p/(1-p))\) \(x\) \(\log(1+e^\eta)\)
泊松分布 \(\log\lambda\) \(x\) \(e^\eta\)
指数分布 \(-\lambda\) \(x\) \(-\log(-\eta)\)

重要性质: 1. \(E[T(X)] = \nabla_\eta A(\eta)\) 2. \(Var(T(X)) = \nabla^2_\eta A(\eta)\)

应用: - 广义线性模型(GLM) - 变分推断 - 自然梯度下降

Python
# 指数族分布的统一视角
import numpy as np
from scipy.special import logsumexp

class ExponentialFamily:
    """
    指数族分布的基类
    """
    def __init__(self, natural_params):
        self.eta = natural_params

    def log_partition(self):
        """对数配分函数 A(eta)"""
        raise NotImplementedError

    def sufficient_stats(self, x):
        """充分统计量 T(x)"""
        raise NotImplementedError

    def log_pdf(self, x):
        """对数概率密度"""
        return np.dot(self.eta, self.sufficient_stats(x)) - self.log_partition()  # np.dot计算点积/矩阵乘法

    def mean(self):
        """E[T(X)] = dA/deta"""
        raise NotImplementedError

    def variance(self):
        """Var(T(X)) = d^2A/deta^2"""
        raise NotImplementedError

class GaussianEF(ExponentialFamily):
    """正态分布的指数族形式"""
    def __init__(self, mu, sigma2):
        # 自然参数: [mu/sigma^2, -1/(2*sigma^2)]
        eta = np.array([mu / sigma2, -0.5 / sigma2])
        super().__init__(eta)  # super()调用父类方法
        self.mu = mu
        self.sigma2 = sigma2

    def log_partition(self):
        eta1, eta2 = self.eta
        return -eta1**2 / (4*eta2) - 0.5 * np.log(-2*eta2)

    def sufficient_stats(self, x):
        return np.array([x, x**2])

    def mean(self):
        return np.array([self.mu, self.sigma2 + self.mu**2])

    def variance(self):
        return np.array([[self.sigma2, 2*self.mu*self.sigma2],
                        [2*self.mu*self.sigma2, 2*self.sigma2**2 + 4*self.mu**2*self.sigma2]])

# 验证
gaussian = GaussianEF(mu=2, sigma2=4)
print(f"自然参数: {gaussian.eta}")
print(f"对数配分函数: {gaussian.log_partition():.4f}")
print(f"均值: {gaussian.mean()}")

2.5 随机过程

马尔可夫链

定义: - 状态序列 \(X_1, X_2, ..., X_n\) - 马尔可夫性质:\(P(X_{t+1} | X_t, X_{t-1}, ..., X_1) = P(X_{t+1} | X_t)\)

转移矩阵\(P_{ij} = P(X_{t+1} = j | X_t = i)\)

平稳分布\(\pi = \pi P\),满足细致平衡条件:\(\pi_i P_{ij} = \pi_j P_{ji}\)

应用: - PageRank算法 - MCMC采样 - 强化学习中的MDP

Python
import numpy as np

class MarkovChain:
    def __init__(self, transition_matrix):
        self.P = transition_matrix
        self.n_states = transition_matrix.shape[0]

    def simulate(self, initial_state, n_steps):
        """模拟马尔可夫链"""
        states = [initial_state]
        current = initial_state

        for _ in range(n_steps - 1):
            current = np.random.choice(self.n_states, p=self.P[current])
            states.append(current)

        return states

    def stationary_distribution(self):
        """计算平稳分布"""
        # 解方程: pi P = pi, sum(pi) = 1
        A = np.vstack([self.P.T - np.eye(self.n_states), np.ones(self.n_states)])
        b = np.zeros(self.n_states + 1)
        b[-1] = 1  # [-1]负索引取最后一个元素

        pi = np.linalg.lstsq(A, b, rcond=None)[0]
        return pi

    def n_step_transition(self, n):
        """n步转移矩阵"""
        return np.linalg.matrix_power(self.P, n)

# 示例:天气模型
# 状态: 0=晴天, 1=多云, 2=雨天
P = np.array([
    [0.7, 0.2, 0.1],  # 晴天 -> 晴天/多云/雨天
    [0.3, 0.4, 0.3],  # 多云 -> 晴天/多云/雨天
    [0.2, 0.3, 0.5]   # 雨天 -> 晴天/多云/雨天
])

mc = MarkovChain(P)
pi = mc.stationary_distribution()
print(f"平稳分布: 晴天={pi[0]:.3f}, 多云={pi[1]:.3f}, 雨天={pi[2]:.3f}")

# 长期预测
P_30 = mc.n_step_transition(30)
print(f"\n30天后(无论从什么天气开始):")
print(f"晴天概率: {P_30[0, 0]:.3f}")
print(f"多云概率: {P_30[0, 1]:.3f}")
print(f"雨天概率: {P_30[0, 2]:.3f}")

高斯过程

定义: - 随机变量的集合,任意有限个变量的联合分布都是多元高斯分布 - 完全由均值函数 \(m(x)\) 和协方差函数 \(k(x, x')\) 决定

协方差函数(核函数): - RBF核:\(k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2l^2}\right)\) - 线性核:\(k(x, x') = x^T x'\) - 周期核:\(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{l^2}\right)\)

应用: - 贝叶斯优化 - 回归和分类 - 时间序列预测

Python
import numpy as np
from scipy.spatial.distance import cdist

class GaussianProcess:
    """
    高斯过程回归
    """
    def __init__(self, kernel='rbf', noise=1e-5, **kernel_params):
        self.kernel_name = kernel
        self.noise = noise
        self.kernel_params = kernel_params
        self.X_train = None
        self.y_train = None
        self.K = None

    def kernel(self, X1, X2):
        """计算核矩阵"""
        if self.kernel_name == 'rbf':
            length_scale = self.kernel_params.get('length_scale', 1.0)
            sigma_f = self.kernel_params.get('sigma_f', 1.0)

            dist = cdist(X1, X2, 'sqeuclidean')
            return sigma_f**2 * np.exp(-0.5 * dist / length_scale**2)

        elif self.kernel_name == 'linear':
            return X1 @ X2.T

        else:
            raise ValueError(f"Unknown kernel: {self.kernel_name}")

    def fit(self, X, y):
        """训练高斯过程"""
        self.X_train = X
        self.y_train = y

        # 计算核矩阵
        K = self.kernel(X, X)
        self.K = K + self.noise * np.eye(len(X))
        self.L = np.linalg.cholesky(self.K)
        self.alpha = np.linalg.solve(self.L.T, np.linalg.solve(self.L, y))

        return self

    def predict(self, X_test, return_std=True):
        """预测"""
        # 计算测试点与训练点的核
        K_s = self.kernel(self.X_train, X_test)

        # 均值
        mu = K_s.T @ self.alpha

        if return_std:
            # 方差
            K_ss = self.kernel(X_test, X_test)
            v = np.linalg.solve(self.L, K_s)
            var = np.diag(K_ss) - np.sum(v**2, axis=0)
            std = np.sqrt(np.maximum(var, 0))
            return mu, std

        return mu

# 示例:用GP拟合函数
np.random.seed(42)

# 真实函数
def f(x):
    return np.sin(x) + 0.1 * x**2

# 训练数据
X_train = np.array([-3, -2, -1, 0, 1, 2, 3]).reshape(-1, 1)
y_train = f(X_train).ravel() + 0.1 * np.random.randn(len(X_train))

# 测试点
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)

# 拟合GP
gp = GaussianProcess(kernel='rbf', length_scale=1.0, sigma_f=1.0)
gp.fit(X_train, y_train)
mu, std = gp.predict(X_test)

# 可视化
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(X_test, f(X_test), 'r--', label='True function')
plt.plot(X_test, mu, 'b-', label='GP mean')
plt.fill_between(X_test.ravel(), mu - 2*std, mu + 2*std, alpha=0.2, label='95% confidence')
plt.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Training data')
plt.legend()
plt.title('Gaussian Process Regression')
plt.show()

2.6 统计推断

最大似然估计(MLE)

原理:选择使观测数据出现概率最大的参数

\[\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta) = \arg\max_\theta \sum_{i=1}^n \log P(x_i|\theta)\]

性质: - 一致性:\(\hat{\theta}_{MLE} \xrightarrow{p} \theta_0\)(当\(n \to \infty\)) - 渐近正态性:\(\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta_0))\) - 渐近有效性:达到Cramér-Rao下界

Python
from scipy.optimize import minimize
import numpy as np
from scipy.stats import norm

def maximum_likelihood_estimation(data, distribution='normal'):
    """
    最大似然估计

    data: 观测数据
    distribution: 分布类型
    """
    if distribution == 'normal':
        # 正态分布的MLE有解析解
        mu_mle = np.mean(data)
        sigma2_mle = np.var(data, ddof=0)  # 注意:MLE用n而不是n-1
        return {'mu': mu_mle, 'sigma2': sigma2_mle}

    elif distribution == 'exponential':
        # 指数分布: lambda_MLE = 1/mean
        lambda_mle = 1 / np.mean(data)
        return {'lambda': lambda_mle}

    else:
        raise ValueError(f"Unknown distribution: {distribution}")

# 示例:估计正态分布参数
true_mu, true_sigma = 5, 2
data = np.random.normal(true_mu, true_sigma, size=1000)

params = maximum_likelihood_estimation(data, 'normal')
print(f"真实参数: mu={true_mu}, sigma^2={true_sigma**2}")
print(f"MLE估计: mu={params['mu']:.4f}, sigma^2={params['sigma2']:.4f}")

# 可视化似然函数
mu_range = np.linspace(3, 7, 100)
sigma_range = np.linspace(1, 4, 100)
Mu, Sigma = np.meshgrid(mu_range, sigma_range)

log_likelihood = np.zeros_like(Mu)
for i in range(len(sigma_range)):
    for j in range(len(mu_range)):
        log_likelihood[i, j] = np.sum(norm.logpdf(data, Mu[i, j], Sigma[i, j]))

plt.figure(figsize=(10, 6))
contour = plt.contour(Mu, Sigma, log_likelihood, levels=20)
plt.clabel(contour, inline=True, fontsize=8)
plt.plot(params['mu'], np.sqrt(params['sigma2']), 'r*', markersize=15, label='MLE')
plt.xlabel('mu')
plt.ylabel('sigma')
plt.title('Log-Likelihood Contours')
plt.legend()
plt.show()

Fisher信息矩阵与Cramér-Rao下界

Fisher信息矩阵

\[I(\theta) = E\left[\left(\frac{\partial \log P(X|\theta)}{\partial \theta}\right)\left(\frac{\partial \log P(X|\theta)}{\partial \theta}\right)^T\right] = -E\left[\frac{\partial^2 \log P(X|\theta)}{\partial \theta^2}\right]\]

Cramér-Rao下界

任何无偏估计量的方差满足:

\[Var(\hat{\theta}) \geq I^{-1}(\theta)\]

意义:MLE在渐近意义下达到最优

Python
def compute_fisher_information_normal(mu, sigma2, n):
    """
    计算正态分布的Fisher信息矩阵

    对于正态分布 N(mu, sigma^2):
    I(mu) = n / sigma^2
    I(sigma^2) = n / (2 * sigma^4)
    """
    I_mu = n / sigma2
    I_sigma2 = n / (2 * sigma2**2)

    return I_mu, I_sigma2

# 验证Cramér-Rao下界
n = 100
mu_true, sigma2_true = 5, 4

# 多次实验,计算样本均值的方差
n_experiments = 1000
mu_estimates = []

for _ in range(n_experiments):
    data = np.random.normal(mu_true, np.sqrt(sigma2_true), n)
    mu_estimates.append(np.mean(data))

empirical_variance = np.var(mu_estimates, ddof=1)

# 理论下界
I_mu, _ = compute_fisher_information_normal(mu_true, sigma2_true, n)
cramer_rao_bound = 1 / I_mu

print(f"样本均值的实证方差: {empirical_variance:.6f}")
print(f"Cramér-Rao下界: {cramer_rao_bound:.6f}")
print(f"MLE达到了下界: {np.isclose(empirical_variance, cramer_rao_bound, rtol=0.1)}")

最大后验估计(MAP)

原理:结合先验知识和观测数据

\[\hat{\theta}_{MAP} = \arg\max_\theta P(\theta|D) = \arg\max_\theta P(D|\theta)P(\theta)\]

与MLE的关系: - 当先验是均匀分布时,MAP = MLE - MAP可以看作正则化的MLE

Python
def map_estimate_normal(data, prior_mu=0, prior_sigma2=1, prior_weight=1):
    """
    正态分布均值的MAP估计

    先验: mu ~ N(prior_mu, prior_sigma2)
    似然: data ~ N(mu, sigma^2)(假设sigma^2已知)
    """
    n = len(data)
    sample_mean = np.mean(data)

    # 假设sigma^2 = 1(简化)
    sigma2 = 1

    # MAP估计是高斯先验和高斯似然的加权平均
    # 后验: mu|data ~ N(mu_n, sigma_n^2)
    # 其中:
    # 1/sigma_n^2 = n/sigma^2 + 1/prior_sigma2
    # mu_n = sigma_n^2 * (n*sample_mean/sigma^2 + prior_mu/prior_sigma2)

    posterior_precision = n / sigma2 + 1 / prior_sigma2
    posterior_variance = 1 / posterior_precision

    posterior_mean = posterior_variance * (n * sample_mean / sigma2 + prior_mu / prior_sigma2)

    return posterior_mean, posterior_variance

# 示例:小样本情况下的MAP优势
np.random.seed(42)
true_mu = 5
n_samples = 5  # 小样本

data = np.random.normal(true_mu, 1, n_samples)

# MLE
mle = np.mean(data)

# MAP(强先验:认为mu在0附近)
map_est, map_var = map_estimate_normal(data, prior_mu=0, prior_sigma2=1)

# MAP(弱先验)
map_est_weak, _ = map_estimate_normal(data, prior_mu=0, prior_sigma2=10)

print(f"真实值: {true_mu}")
print(f"MLE: {mle:.4f}")
print(f"MAP(强先验): {map_est:.4f}")
print(f"MAP(弱先验): {map_est_weak:.4f}")
print(f"\n强先验将估计向0拉近,起到正则化作用")

三、微积分与优化

3.1 微积分基础

梯度、Hessian与泰勒展开

一阶泰勒展开

\[f(\mathbf{x} + \mathbf{\Delta x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \mathbf{\Delta x}\]

二阶泰勒展开

\[f(\mathbf{x} + \mathbf{\Delta x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \mathbf{\Delta x} + \frac{1}{2} \mathbf{\Delta x}^T H(\mathbf{x}) \mathbf{\Delta x}\]

几何意义: - 梯度:函数增长最快的方向 - Hessian:函数的局部曲率

Python
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# 可视化梯度
def f(x, y):
    return x**2 + 2*y**2

def grad_f(x, y):
    return np.array([2*x, 4*y])

# 绘制函数和梯度场
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

U, V = grad_f(X, Y)

plt.figure(figsize=(10, 8))
contour = plt.contour(X, Y, Z, levels=15)
plt.clabel(contour, inline=True, fontsize=8)
plt.quiver(X, Y, U, V, alpha=0.6)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Function Contours and Gradient Field')
plt.colorbar(contour)
plt.show()

# 梯度指向函数增长最快的方向,垂直于等高线

链式法则与反向传播

多变量链式法则

\(z = f(y_1, y_2, ..., y_m)\)\(y_i = g_i(x_1, x_2, ..., x_n)\),则:

\[\frac{\partial z}{\partial x_j} = \sum_{i=1}^m \frac{\partial z}{\partial y_i} \frac{\partial y_i}{\partial x_j}\]

反向传播的本质: - 动态规划高效计算梯度 - 从输出层向输入层传播误差

Python
import numpy as np

class ComputationalGraph:
    """
    计算图和反向传播的简单实现
    """
    def __init__(self):
        self.gradients = {}

    def forward(self, x, y, z):
        """
        计算: f = (x + y) * z
        """
        self.x = x
        self.y = y
        self.z = z

        self.q = x + y  # 中间变量
        self.f = self.q * z

        return self.f

    def backward(self, df=1):
        """
        反向传播计算梯度

        f = q * z
        q = x + y

        df/df = 1
        df/dq = z
        df/dz = q
        df/dx = df/dq * dq/dx = z * 1 = z
        df/dy = df/dq * dq/dy = z * 1 = z
        """
        # 从输出开始
        df_dq = self.z
        df_dz = self.q

        # 传播到输入
        df_dx = df_dq * 1  # dq/dx = 1
        df_dy = df_dq * 1  # dq/dy = 1

        self.gradients = {
            'x': df_dx,
            'y': df_dy,
            'z': df_dz
        }

        return self.gradients

# 示例
graph = ComputationalGraph()
f = graph.forward(x=2, y=3, z=4)
grads = graph.backward()  # 反向传播计算梯度

print(f"f = (2 + 3) * 4 = {f}")
print(f"df/dx = {grads['x']}")
print(f"df/dy = {grads['y']}")
print(f"df/dz = {grads['z']}")

# 数值验证
eps = 1e-5
f_x_plus = graph.forward(x=2+eps, y=3, z=4)
grad_x_numerical = (f_x_plus - f) / eps
print(f"\n数值验证 df/dx: {grad_x_numerical}")

3.2 凸优化理论

凸集与凸函数

凸集定义:集合\(C\)是凸的,如果对于任意\(\mathbf{x}, \mathbf{y} \in C\)\(\theta \in [0, 1]\)

\[\theta \mathbf{x} + (1-\theta) \mathbf{y} \in C\]

凸函数定义:函数\(f\)是凸的,如果:

\[f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})\]

等价条件(可微函数): - 一阶条件:\(f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x})\) - 二阶条件:\(H(\mathbf{x}) \succeq 0\)(Hessian半正定)

为什么凸优化重要? - 局部最优 = 全局最优 - 有高效的算法 - 很多ML问题是凸的(或可以近似为凸的)

Python
import numpy as np
import matplotlib.pyplot as plt

# 可视化凸函数和非凸函数
x = np.linspace(-3, 3, 100)

# 凸函数
f_convex = x**2

# 非凸函数
f_nonconvex = x**4 - 2*x**2 + 0.5*x

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(x, f_convex, 'b-', linewidth=2)
plt.title('Convex Function: $f(x) = x^2$')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)

# 画割线说明凸性
x1, x2 = -1.5, 1.5
plt.plot([x1, x2], [x1**2, x2**2], 'r--', label='Chord')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(x, f_nonconvex, 'b-', linewidth=2)
plt.title('Non-Convex Function')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)

plt.tight_layout()
plt.show()

# 验证凸性:检查Hessian是否半正定
def check_convexity_1d(f_func, x_range):
    """检查一元函数的凸性(通过二阶导数)"""
    from scipy.optimize import approx_fprime

    # 使用中心差分近似计算二阶导数
    def second_derivative(f, x, eps=1e-6):
        """计算二阶导数的中心差分近似"""
        return (f(x + eps) - 2 * f(x) + f(x - eps)) / (eps ** 2)

    second_derivatives = [second_derivative(f_func, x) for x in x_range]

    if all(d >= 0 for d in second_derivatives):  # all()所有元素为True才返回True
        return True, min(second_derivatives)
    else:
        return False, min(second_derivatives)

is_convex, min_second_deriv = check_convexity_1d(lambda x: x**2, np.linspace(-2, 2, 10))
print(f"f(x)=x^2 是凸函数: {is_convex}, 最小二阶导数: {min_second_deriv:.4f}")

对偶理论

拉格朗日函数

对于优化问题: $\(\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, \quad h_j(\mathbf{x}) = 0\)$

拉格朗日函数: $\(L(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})\)$

对偶函数: $\(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} L(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\)$

对偶问题: $\(\max_{\boldsymbol{\lambda} \geq 0, \boldsymbol{\nu}} g(\boldsymbol{\lambda}, \boldsymbol{\nu})\)$

弱对偶性\(d^* \leq p^*\)(对偶最优值 ≤ 原始最优值) 强对偶性\(d^* = p^*\)(在凸优化且满足约束规格时成立)

Python
from scipy.optimize import minimize
import numpy as np

def solve_primal_dual():
    """
    求解原始问题和对偶问题

    原始问题:
    min x^2 + y^2
    s.t. x + y >= 1
    """
    # 原始问题
    def objective(x):
        return x[0]**2 + x[1]**2

    def constraint(x):
        return x[0] + x[1] - 1  # >= 0

    # 求解原始问题
    result_primal = minimize(
        objective,
        x0=[0, 0],
        method='SLSQP',
        constraints={'type': 'ineq', 'fun': constraint}
    )

    p_star = result_primal.fun
    x_star = result_primal.x

    # 对偶问题: max_lambda min_{x,y} x^2 + y^2 + lambda*(1 - x - y)
    # 对于固定的lambda,最优x = y = lambda/2
    # 代入得: g(lambda) = lambda - lambda^2/2
    # 对偶问题: max_lambda lambda - lambda^2/2, s.t. lambda >= 0

    def dual_objective(lambda_):
        return -(lambda_ - lambda_**2 / 2)  # 取负因为minimize做最大化

    result_dual = minimize(
        dual_objective,
        x0=[0.5],
        method='SLSQP',
        bounds=[(0, None)]
    )

    d_star = -result_dual.fun
    lambda_star = result_dual.x[0]

    print("原始问题:")
    print(f"  最优值 p* = {p_star:.6f}")
    print(f"  最优解 x* = [{x_star[0]:.6f}, {x_star[1]:.6f}]")

    print("\n对偶问题:")
    print(f"  最优值 d* = {d_star:.6f}")
    print(f"  最优解 lambda* = {lambda_star:.6f}")

    print(f"\n对偶间隙: p* - d* = {p_star - d_star:.10f}")
    print(f"强对偶性成立: {np.isclose(p_star, d_star)}")

solve_primal_dual()

KKT条件

KKT条件(Karush-Kuhn-Tucker):

对于凸优化问题,如果强对偶性成立,则\(\mathbf{x}^*\)是最优解当且仅当存在\(\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*\)满足:

  1. 稳定性\(\nabla f(\mathbf{x}^*) + \sum_i \lambda_i^* \nabla g_i(\mathbf{x}^*) + \sum_j \nu_j^* \nabla h_j(\mathbf{x}^*) = 0\)
  2. 原始可行性\(g_i(\mathbf{x}^*) \leq 0\)\(h_j(\mathbf{x}^*) = 0\)
  3. 对偶可行性\(\lambda_i^* \geq 0\)
  4. 互补松弛性\(\lambda_i^* g_i(\mathbf{x}^*) = 0\)

意义:KKT条件将约束优化问题转化为方程组求解

Python
import numpy as np
from scipy.optimize import minimize

def verify_kkt():
    """
    验证KKT条件

    问题: min (x-2)^2 + (y-1)^2
          s.t. x + y <= 2
               x >= 0, y >= 0
    """
    # 求解
    def objective(x):
        return (x[0] - 2)**2 + (x[1] - 1)**2

    def grad_objective(x):
        return np.array([2*(x[0]-2), 2*(x[1]-1)])

    constraints = [
        {'type': 'ineq', 'fun': lambda x: 2 - x[0] - x[1]},  # x + y <= 2
        {'type': 'ineq', 'fun': lambda x: x[0]},             # x >= 0
        {'type': 'ineq', 'fun': lambda x: x[1]}              # y >= 0
    ]

    result = minimize(objective, x0=[0, 0], method='SLSQP', constraints=constraints)

    x_star = result.x
    print(f"最优解: x* = [{x_star[0]:.6f}, {x_star[1]:.6f}]")
    print(f"约束 x + y <= 2: {x_star[0] + x_star[1]:.6f}")

    # 在解处,第一个约束是紧的(active)
    # 根据KKT条件,存在 lambda >= 0 使得:
    # grad_f(x*) + lambda * grad_g(x*) = 0
    # 其中 g(x) = x + y - 2

    # grad_f = [2(x-2), 2(y-1)]
    # grad_g = [1, 1]

    # 2(x-2) + lambda = 0  =>  lambda = 2(2-x)
    # 2(y-1) + lambda = 0  =>  lambda = 2(1-y)

    lambda_kkt = 2 * (2 - x_star[0])
    print(f"\nKKT乘子: lambda = {lambda_kkt:.6f}")

    # 验证稳定性条件
    grad_f = grad_objective(x_star)
    grad_g = np.array([1, 1])
    stability = grad_f + lambda_kkt * grad_g

    print(f"\nKKT稳定性条件:")
    print(f"  grad_f(x*) = [{grad_f[0]:.6f}, {grad_f[1]:.6f}]")
    print(f"  grad_f(x*) + lambda * grad_g(x*) = [{stability[0]:.10f}, {stability[1]:.10f}]")
    print(f"  条件满足: {np.allclose(stability, 0)}")

    # 验证互补松弛性
    complementarity = lambda_kkt * (x_star[0] + x_star[1] - 2)
    print(f"\n互补松弛性: lambda * g(x*) = {complementarity:.10f}")

verify_kkt()

3.3 高级优化算法

牛顿法与拟牛顿法

牛顿法

利用二阶信息(Hessian)加速收敛:

\[\mathbf{x}_{t+1} = \mathbf{x}_t - H^{-1}(\mathbf{x}_t) \nabla f(\mathbf{x}_t)\]

优点: - 二次收敛(在解附近) - 收敛速度快

缺点: - 需要计算和存储Hessian矩阵 - 需要求解线性方程组

拟牛顿法(BFGS):

用近似矩阵\(B_t\)代替Hessian,满足割线条件:

\[B_{t+1} \mathbf{s}_t = \mathbf{y}_t\]

其中\(\mathbf{s}_t = \mathbf{x}_{t+1} - \mathbf{x}_t\)\(\mathbf{y}_t = \nabla f_{t+1} - \nabla f_t\)

Python
import numpy as np
from scipy.optimize import minimize

def rosenbrock(x):
    """Rosenbrock函数(优化测试函数)"""
    return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)  # 切片操作取子序列

def rosenbrock_grad(x):
    """Rosenbrock函数的梯度"""
    xm = x[1:-1]
    xm_m1 = x[:-2]
    xm_p1 = x[2:]
    der = np.zeros_like(x)
    der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
    der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    der[-1] = 200*(x[-1]-x[-2]**2)
    return der

# 比较不同优化算法
x0 = np.array([-1.0, 1.0])

methods = ['Nelder-Mead', 'BFGS', 'Newton-CG']
results = {}

for method in methods:
    if method == 'Newton-CG':
        # 牛顿法需要Hessian或Hessian向量积
        result = minimize(rosenbrock, x0, method=method,
                         jac=rosenbrock_grad,  # Newton-CG会自动使用有限差分近似Hessian
                         options={'maxiter': 100})
    else:
        result = minimize(rosenbrock, x0, method=method,
                         jac=rosenbrock_grad if method != 'Nelder-Mead' else None,
                         options={'maxiter': 100})

    results[method] = {
        'success': result.success,
        'iterations': result.nit,
        'function_evals': result.nfev,
        'final_value': result.fun,
        'solution': result.x
    }

print("优化算法比较:\n")
print(f"{'方法':<15} {'成功':<8} {'迭代':<8} {'函数评估':<12} {'最终值':<15}")
print("-" * 60)
for method, res in results.items():
    print(f"{method:<15} {str(res['success']):<8} {res['iterations']:<8} "
          f"{res['function_evals']:<12} {res['final_value']:<15.6e}")

随机优化理论

SGD的收敛性

对于凸函数,步长\(\eta_t = \frac{\eta_0}{\sqrt{t}}\)时:

\[E[f(\bar{\mathbf{x}}_T)] - f^* \leq O\left(\frac{1}{\sqrt{T}}\right)\]

对于强凸函数,步长\(\eta_t = \frac{1}{\mu t}\)时:

\[E[f(\mathbf{x}_T)] - f^* \leq O\left(\frac{1}{T}\right)\]

方差缩减方法: - SVRG(Stochastic Variance Reduced Gradient) - SAGA - 收敛速度:\(O(\frac{1}{T})\)(与全梯度方法相同)

Python
import numpy as np

class StochasticOptimizer:
    """
    随机优化算法的统一框架
    """
    def __init__(self, lr_schedule='constant'):
        self.lr_schedule = lr_schedule

    def learning_rate(self, t, eta0=0.01):
        """学习率调度"""
        if self.lr_schedule == 'constant':
            return eta0
        elif self.lr_schedule == 'decay':
            return eta0 / (1 + 0.1 * t)
        elif self.lr_schedule == 'sqrt_decay':
            return eta0 / np.sqrt(1 + t)
        elif self.lr_schedule == 'inv_t':
            return eta0 / (1 + t)

    def sgd(self, grad_func, x0, n_iterations=1000, batch_size=1, eta0=0.01):
        """随机梯度下降"""
        x = x0.copy()
        trajectory = [x.copy()]

        for t in range(n_iterations):
            eta = self.learning_rate(t, eta0)

            # 随机采样梯度(这里简化处理)
            grad = grad_func(x, batch_size)
            x = x - eta * grad

            trajectory.append(x.copy())

        return x, trajectory

    def svrg(self, grad_func, full_grad_func, x0, n_epochs=10, n_inner=100, eta=0.01):
        """SVRG: 方差缩减随机梯度下降"""
        x = x0.copy()

        for epoch in range(n_epochs):
            # 计算全梯度
            mu = full_grad_func(x)
            x_snapshot = x.copy()

            for t in range(n_inner):
                # 随机梯度
                grad_current = grad_func(x, 1)
                grad_snapshot = grad_func(x_snapshot, 1)

                # 方差缩减梯度
                v = grad_current - grad_snapshot + mu

                x = x - eta * v

        return x

# 示例:线性回归的SGD
np.random.seed(42)
n_samples, n_features = 1000, 10
X = np.random.randn(n_samples, n_features)
true_w = np.random.randn(n_features)
y = X @ true_w + 0.1 * np.random.randn(n_samples)

def grad_func(w, batch_size):
    """随机梯度"""
    indices = np.random.choice(n_samples, batch_size, replace=False)
    X_batch = X[indices]
    y_batch = y[indices]
    return 2 * X_batch.T @ (X_batch @ w - y_batch) / batch_size

def full_grad_func(w):
    """全梯度"""
    return 2 * X.T @ (X @ w - y) / n_samples

optimizer = StochasticOptimizer(lr_schedule='sqrt_decay')
w_sgd, _ = optimizer.sgd(grad_func, np.zeros(n_features),
                         n_iterations=1000, batch_size=10, eta0=0.01)

print(f"真实参数: {true_w[:3]}")
print(f"SGD估计:  {w_sgd[:3]}")
print(f"误差: {np.linalg.norm(w_sgd - true_w):.6f}")

自适应学习率算法

AdaGrad

\[\mathbf{g}_t = \nabla f(\mathbf{x}_t)$$ $$\mathbf{G}_t = \mathbf{G}_{t-1} + \mathbf{g}_t \odot \mathbf{g}_t$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{\mathbf{G}_t + \epsilon}} \odot \mathbf{g}_t\]

特点:对稀疏梯度给予更大学习率

RMSprop

\[\mathbf{E}[\mathbf{g}^2]_t = \gamma \mathbf{E}[\mathbf{g}^2]_{t-1} + (1-\gamma) \mathbf{g}_t^2\]

Adam(Adaptive Moment Estimation):

\[\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t \quad \text{(一阶矩估计)}$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) \mathbf{g}_t^2 \quad \text{(二阶矩估计)}$$ $$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t\]
Python
import numpy as np

class AdaptiveOptimizer:
    """
    自适应学习率优化算法
    """
    def __init__(self, params):
        self.params = params
        self.t = 0

    def step(self, grads):
        raise NotImplementedError

class Adam(AdaptiveOptimizer):
    """Adam优化器"""
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps

        self.m = np.zeros_like(params)
        self.v = np.zeros_like(params)

    def step(self, grads):
        self.t += 1

        # 更新一阶矩和二阶矩
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grads ** 2)

        # 偏差修正
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        # 更新参数
        self.params = self.params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

        return self.params

class AdaGrad(AdaptiveOptimizer):
    """AdaGrad优化器"""
    def __init__(self, params, lr=0.01, eps=1e-8):
        super().__init__(params)
        self.lr = lr
        self.eps = eps
        self.G = np.zeros_like(params)

    def step(self, grads):
        self.G += grads ** 2
        self.params = self.params - self.lr * grads / (np.sqrt(self.G) + self.eps)
        return self.params

# 比较不同优化器
np.random.seed(42)

# 病态条件的二次函数
Q = np.diag([1000, 100, 10, 1, 0.1])
b = np.ones(5)

def f(x):
    return 0.5 * x.T @ Q @ x - b.T @ x

def grad(x):
    return Q @ x - b

x0 = np.ones(5) * 10

optimizers = {
    'SGD': (lambda x: x - 0.001 * grad(x)),
    'AdaGrad': AdaGrad(x0.copy(), lr=0.1),
    'Adam': Adam(x0.copy(), lr=0.01)
}

results = {}
for name, opt in optimizers.items():
    if callable(opt):
        # 简单SGD
        x = x0.copy()
        history = [f(x)]
        for _ in range(1000):
            x = opt(x)
            history.append(f(x))
    else:
        # 自适应方法
        x = opt.params.copy()
        history = [f(x)]
        for _ in range(1000):
            g = grad(x)
            x = opt.step(g)
            history.append(f(x))

    results[name] = history

# 绘制收敛曲线
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
for name, history in results.items():
    plt.semilogy(history, label=name)

plt.xlabel('Iteration')
plt.ylabel('Function Value (log scale)')
plt.title('Optimizer Comparison on Ill-Conditioned Problem')
plt.legend()
plt.grid(True)
plt.show()

print(f"最优值: {f(np.linalg.solve(Q, b)):.10f}")
for name, history in results.items():
    print(f"{name}最终值: {history[-1]:.10f}")

3.4 约束优化

投影梯度法

问题\(\min_{\mathbf{x} \in C} f(\mathbf{x})\)

算法

\[\mathbf{x}_{t+1} = \Pi_C(\mathbf{x}_t - \eta \nabla f(\mathbf{x}_t))\]

其中\(\Pi_C(\mathbf{x})\)\(\mathbf{x}\)在集合\(C\)上的投影

常见投影: - 非负约束:\(\Pi_{\mathbb{R}^+_n}(\mathbf{x}) = \max(\mathbf{x}, 0)\) - L2球:\(\Pi_{\|\mathbf{x}\|_2 \leq r}(\mathbf{x}) = r \frac{\mathbf{x}}{\max(\|\mathbf{x}\|_2, r)}\) - L1球:需要专门算法

Python
import numpy as np

def project_l2_ball(x, radius=1.0):
    """投影到L2球"""
    norm = np.linalg.norm(x)
    if norm <= radius:
        return x
    return radius * x / norm

def project_simplex(v, z=1.0):
    """
    投影到概率单纯形: {x | x >= 0, sum(x) = z}
    使用排序算法,O(n log n)
    """
    n = len(v)
    u = np.sort(v)[::-1]
    cssv = np.cumsum(u) - z
    ind = np.arange(n) + 1
    cond = u - cssv / ind > 0
    rho = ind[cond][-1]
    theta = cssv[cond][-1] / rho
    return np.maximum(v - theta, 0)

def projected_gradient_descent(grad_func, x0, projection_func,
                                n_iterations=1000, eta=0.01):
    """投影梯度下降"""
    x = x0.copy()
    trajectory = []

    for t in range(n_iterations):
        grad = grad_func(x)
        x = x - eta * grad
        x = projection_func(x)
        trajectory.append(x.copy())

    return x, trajectory

# 示例:在概率单纯形上优化
np.random.seed(42)

# 目标: 最小化 ||x - target||^2,约束 x >= 0, sum(x) = 1
target = np.array([0.1, 0.3, 0.4, 0.2])

def grad(x):
    return 2 * (x - target)

x0 = np.ones(4) / 4  # 均匀分布作为初始点

x_opt, traj = projected_gradient_descent(
    grad, x0, lambda x: project_simplex(x, z=1.0),
    n_iterations=500, eta=0.1
)

print(f"目标分布: {target}")
print(f"优化结果: {x_opt}")
print(f"约束检查: sum = {x_opt.sum():.6f}, min = {x_opt.min():.6f}")
print(f"误差: {np.linalg.norm(x_opt - target):.6f}")

近端梯度法(Proximal Gradient)

问题\(\min_\mathbf{x} f(\mathbf{x}) + g(\mathbf{x})\)

其中\(f\)是光滑的,\(g\)是凸的(可能非光滑)

算法

\[\mathbf{x}_{t+1} = \text{prox}_{\eta g}(\mathbf{x}_t - \eta \nabla f(\mathbf{x}_t))\]

近端算子

\[\text{prox}_{\lambda g}(\mathbf{x}) = \arg\min_\mathbf{u} \left\{ g(\mathbf{u}) + \frac{1}{2\lambda} \|\mathbf{u} - \mathbf{x}\|_2^2 \right\}\]

常见近端算子: - L1正则(软阈值):\(\text{prox}_{\lambda \|\cdot\|_1}(x) = \text{sign}(x) \max(|x| - \lambda, 0)\) - L2正则:\(\text{prox}_{\lambda \|\cdot\|_2^2/2}(x) = \frac{x}{1+\lambda}\)

Python
import numpy as np

def soft_threshold(x, lambda_):
    """软阈值算子(L1近端算子)"""
    return np.sign(x) * np.maximum(np.abs(x) - lambda_, 0)

def proximal_gradient_method(grad_f, prox_g, x0,
                             n_iterations=1000, eta=0.01):
    """
    近端梯度法(Proximal Gradient Method)

    解决: min f(x) + g(x)
    """
    x = x0.copy()

    for t in range(n_iterations):
        # 梯度步
        y = x - eta * grad_f(x)

        # 近端步
        x = prox_g(y, eta)

    return x

# 示例:Lasso回归
# min ||Ax - b||^2 + lambda * ||x||_1

np.random.seed(42)
n, p = 100, 50
A = np.random.randn(n, p)
x_true = np.zeros(p)
x_true[:10] = np.random.randn(10)  # 稀疏真实解
b = A @ x_true + 0.1 * np.random.randn(n)

lambda_lasso = 0.1

def grad_f(x):
    """f(x) = ||Ax - b||^2 的梯度"""
    return 2 * A.T @ (A @ x - b)

def prox_g(x, eta):
    """g(x) = lambda * ||x||_1 的近端算子"""
    return soft_threshold(x, eta * lambda_lasso)

x_lasso = proximal_gradient_method(grad_f, prox_g, np.zeros(p),
                                   n_iterations=2000, eta=0.001)

print(f"Lasso解的非零元素: {np.sum(x_lasso != 0)}")
print(f"真实解的非零元素: {np.sum(x_true != 0)}")
print(f"重构误差: {np.linalg.norm(A @ x_lasso - b):.6f}")

# 与最小二乘比较
x_ls = np.linalg.lstsq(A, b, rcond=None)[0]
print(f"最小二乘解的非零元素: {np.sum(x_ls != 0)}")
print(f"最小二乘重构误差: {np.linalg.norm(A @ x_ls - b):.6f}")

四、信息论

4.1 信息熵

香农熵

定义

离散随机变量\(X\)的熵:

\[H(X) = -\sum_{i} P(x_i) \log P(x_i) = E[-\log P(X)]\]

直观解释: - 熵是随机变量的"不确定性"或"信息量" - 均匀分布时熵最大 - 确定性事件熵为0

联合熵与条件熵: - \(H(X, Y) = -\sum_{x,y} P(x,y) \log P(x,y)\) - \(H(Y|X) = \sum_x P(x) H(Y|X=x) = H(X,Y) - H(X)\)

Python
import numpy as np
from scipy.stats import entropy

def shannon_entropy(p):
    """计算香农熵"""
    p = np.array(p)
    p = p[p > 0]  # 去除0概率(避免log(0))
    return -np.sum(p * np.log2(p))

# 示例:比较不同分布的熵
uniform = np.ones(8) / 8
biased = np.array([0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.0078125])
deterministic = np.array([1, 0, 0, 0, 0, 0, 0, 0])

print("熵的比较(单位:bits):")
print(f"均匀分布: {shannon_entropy(uniform):.4f} (最大: {np.log2(8):.4f})")
print(f"偏置分布: {shannon_entropy(biased):.4f}")
print(f"确定性分布: {shannon_entropy(deterministic):.4f}")

# 熵与编码长度的关系
print("\n熵与编码:")
print(f"均匀分布需要平均 {shannon_entropy(uniform):.2f} bits/符号")
print(f"偏置分布需要平均 {shannon_entropy(biased):.2f} bits/符号")
print("说明:利用分布的非均匀性可以压缩数据")

微分熵

定义:连续随机变量的熵

\[h(X) = -\int f(x) \log f(x) dx\]

注意: - 微分熵可以为负 - 微分熵不是极限情况下离散熵的近似 - 真正的信息量需要考虑量化精度

常见分布的微分熵

分布 微分熵
均匀分布 \(U(a,b)\) \(\log(b-a)\)
正态分布 \(\mathcal{N}(\mu, \sigma^2)\) \(\frac{1}{2}\log(2\pi e \sigma^2)\)
指数分布 \(Exp(\lambda)\) \(1 - \log\lambda\)
Python
import numpy as np
from scipy.stats import norm, uniform, expon

def differential_entropy_gaussian(sigma2):
    """正态分布的微分熵: 0.5 * log(2*pi*e*sigma^2)"""
    return 0.5 * np.log(2 * np.pi * np.e * sigma2)

def estimate_differential_entropy(samples, bins='auto'):
    """用直方图估计微分熵"""
    hist, bin_edges = np.histogram(samples, bins=bins, density=True)
    bin_widths = np.diff(bin_edges)

    # 只考虑非零概率的bin
    mask = hist > 0
    entropy = -np.sum(hist[mask] * np.log(hist[mask]) * bin_widths[mask])

    return entropy

# 验证
sigma2 = 4
samples = np.random.normal(0, np.sqrt(sigma2), 100000)

estimated = estimate_differential_entropy(samples)
theoretical = differential_entropy_gaussian(sigma2)

print(f"正态分布 N(0, {sigma2}) 的微分熵:")
print(f"理论值: {theoretical:.4f}")
print(f"估计值: {estimated:.4f}")

4.2 互信息与KL散度

互信息

定义

\[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)\]

意义:知道\(Y\)后,\(X\)的不确定性减少了多少

性质: - \(I(X;Y) \geq 0\) - \(I(X;Y) = 0\) 当且仅当 \(X\)\(Y\) 独立 - \(I(X;Y) = I(Y;X)\)

与相关性比较: - 互信息捕捉任意类型的依赖关系 - 相关性只捕捉线性关系

Python
import numpy as np
from sklearn.feature_selection import mutual_info_regression

def compute_mutual_information_discrete(x, y):
    """
    计算离散变量的互信息
    """
    # 联合分布
    joint = np.zeros((len(np.unique(x)), len(np.unique(y))))
    for i, xi in enumerate(np.unique(x)):
        for j, yj in enumerate(np.unique(y)):
            joint[i, j] = np.mean((x == xi) & (y == yj))

    # 边缘分布
    px = joint.sum(axis=1)
    py = joint.sum(axis=0)

    # 互信息
    mi = 0
    for i in range(len(px)):
        for j in range(len(py)):
            if joint[i, j] > 0:
                mi += joint[i, j] * np.log2(joint[i, j] / (px[i] * py[j]))

    return mi

# 示例:不同类型的依赖关系
n = 1000

# 1. 线性关系
x1 = np.random.randn(n)
y1 = 2 * x1 + 0.1 * np.random.randn(n)

# 2. 非线性关系
x2 = np.random.randn(n)
y2 = x2**2 + 0.1 * np.random.randn(n)

# 3. 独立
x3 = np.random.randn(n)
y3 = np.random.randn(n)

from scipy.stats import pearsonr

def analyze_relationship(x, y, name):
    corr, _ = pearsonr(x, y)
    mi = mutual_info_regression(x.reshape(-1, 1), y)[0]
    print(f"{name}:")
    print(f"  相关系数: {corr:.4f}")
    print(f"  互信息: {mi:.4f}")
    print()

print("互信息与相关系数比较:\n")
analyze_relationship(x1, y1, "线性关系")
analyze_relationship(x2, y2, "非线性关系(二次)")
analyze_relationship(x3, y3, "独立")

KL散度(相对熵)

定义

\[D_{KL}(P \| Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} = E_P\left[\log \frac{P}{Q}\right]\]

性质: - \(D_{KL}(P \| Q) \geq 0\)(Gibbs不等式) - \(D_{KL}(P \| Q) = 0\) 当且仅当 \(P = Q\) - 不对称:\(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\)

变分表示

\[D_{KL}(P \| Q) = \sup_f \left\{ E_P[f(X)] - \log E_Q[e^{f(X)}] \right\}\]

应用: - 变分推断 - 策略梯度(REINFORCE算法) - 生成模型(VAE)

Python
import numpy as np
from scipy.stats import norm, entropy

def kl_divergence_gaussian(mu1, sigma1, mu2, sigma2):
    """
    两个正态分布之间的KL散度

    D_KL(N(mu1, sigma1^2) || N(mu2, sigma2^2))
    = log(sigma2/sigma1) + (sigma1^2 + (mu1-mu2)^2)/(2*sigma2^2) - 0.5
    """
    return (np.log(sigma2 / sigma1) +
            (sigma1**2 + (mu1 - mu2)**2) / (2 * sigma2**2) - 0.5)

def kl_divergence_discrete(p, q):
    """离散分布的KL散度"""
    p = np.array(p)
    q = np.array(q)
    # 只考虑p>0的位置
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

# 示例1:正态分布
mu1, sigma1 = 0, 1
mu2, sigma2 = 1, 2

kl_forward = kl_divergence_gaussian(mu1, sigma1, mu2, sigma2)
kl_reverse = kl_divergence_gaussian(mu2, sigma2, mu1, sigma1)

print(f"KL散度 P||Q: {kl_forward:.4f}")
print(f"KL散度 Q||P: {kl_reverse:.4f}")
print(f"不对称性: {abs(kl_forward - kl_reverse):.4f}")

📚 参考文献

核心论文

线性代数与优化

  1. Matrix Computations - Golub & Van Loan, 2013
  2. 矩阵计算的权威教材

  3. Convex Optimization - Boyd & Vandenberghe, 2004

  4. 凸优化领域的经典教材

  5. The Elements of Statistical Learning - Hastie et al., 2009

  6. 统计学习理论奠基之作

概率与统计

  1. Pattern Recognition and Machine Learning - Bishop, 2006
  2. 贝叶斯方法的经典教材

  3. Information Theory, Inference, and Learning Algorithms - MacKay, 2003

  4. 信息论与机器学习的结合

  5. Variational Inference: A Review for Statisticians - Blei et al., 2017

  6. 变分推断综述

机器学习算法

  1. A Decision-Theoretic Generalization of the k-Nearest Neighbor Rule - Cover & Hart, 1967
  2. k-NN算法理论基础

  3. Random Forests - Breiman, 2001

  4. 随机森林算法

  5. Regularization and Variable Selection via the Elastic Net - Zou & Hastie, 2005

  6. Elastic Net正则化方法

  7. Greedy Function Approximation: A Gradient Boosting Machine - Friedman, 2001

    • 梯度提升算法
  8. LightGBM: A Highly Efficient Gradient Boosting Decision Tree - Ke et al., 2017

    • LightGBM算法
  9. XGBoost: A Scalable Tree Boosting System - Chen & Guestrin, 2016

    • XGBoost算法

深度学习

  1. Deep Learning - Goodfellow et al., 2016

    • 深度学习领域的"圣经"
  2. ImageNet Classification with Deep Convolutional Neural Networks - Krizhevsky et al., 2012

    • AlexNet,深度学习复兴的开端
  3. Attention Is All You Need - Vaswani et al., 2017

    • Transformer开山之作

技术博客

中文博客

💡 以下为推荐的B站UP主,请在B站直接搜索其名称获取最新内容。

  • 3Blue1Brown - 数学可视化(在B站搜索"3Blue1Brown")
  • 同济子豪兄 - 机器学习数学基础(在B站搜索"同济子豪兄")
  • Datawhale - 数据科学数学基础(在B站搜索"Datawhale")

英文博客

开源项目

数学与科学计算

  • NumPy - Python科学计算基础库
  • SciPy - 科学计算工具集
  • SymPy - 符号数学库
  • JAX - 可微计算框架
  • CuPy - GPU加速NumPy

机器学习

可视化与工具

参考书籍

中文书籍

  1. 《线性代数》- 同济大学数学系 编,高等教育出版社
  2. 国内线性代数经典教材

  3. 《概率论与数理统计》- 浙江大学 编,高等教育出版社

  4. 概率统计经典教材

  5. 《高等数学》- 同济大学数学系 编,高等教育出版社

  6. 微积分经典教材

  7. 《机器学习》- 周志华 著,清华大学出版社

  8. "西瓜书",国内机器学习经典教材

  9. 《统计学习方法》- 李航 著,清华大学出版社

  10. 机器学习基础理论的权威教材

  11. 《深度学习》- Ian Goodfellow、Yoshua Bengio、Aaron Courville 著,人民邮电出版社

  12. 深度学习领域的"圣经"

  13. 《凸优化》- Stephen Boyd、Lieven Vandenberghe 著,清华大学出版社

  14. 凸优化领域的经典教材

  15. 《信息论基础》- Thomas M. Cover、Joy A. Thomas 著,机械工业出版社

  16. 信息论领域的经典教材

英文书籍

  1. "Linear Algebra and Its Applications" - Gilbert Strang
  2. Wellesley-Cambridge Press,线性代数经典教材

  3. "Introduction to Linear Algebra" - Gilbert Strang

  4. Wellesley-Cambridge Press,线性代数入门教材

  5. "Probability and Statistics" - Morris H. DeGroot, Mark J. Schervish

  6. Addison-Wesley,概率统计经典教材

  7. "Convex Optimization" - Stephen Boyd, Lieven Vandenberghe

  8. Cambridge University Press,凸优化经典教材

  9. "Pattern Recognition and Machine Learning" - Christopher M. Bishop

  10. Springer,贝叶斯方法经典教材

  11. "The Elements of Statistical Learning" - Trevor Hastie, Robert Tibshirani, Jerome Friedman

  12. Springer,统计学习理论奠基之作

  13. "Deep Learning" - Ian Goodfellow, Yoshua Bengio, Aaron Courville

  14. MIT Press,深度学习领域的权威教材

  15. "Mathematics for Machine Learning" - Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

  16. Cambridge University Press,机器学习数学基础

在线课程

中文课程

💡 以下为推荐的B站UP主和搜索关键词,请在B站直接搜索获取最新内容。

  • 吴恩达机器学习课程 - 在B站搜索"吴恩达 机器学习"
  • 3Blue1Brown数学可视化 - 在B站搜索"3Blue1Brown 线性代数"
  • 同济子豪兄 - 在B站搜索"同济子豪兄 数学基础"
  • 李沐 - 在B站搜索"李沐 机器学习"

英文课程

社区资源

中文社区

英文社区

论坛与问答

邮件列表与Slack