LoRA 从零实现完整教程¶
⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。
📌 本章定位:从零实现 LoRA ( Low-Rank Adaptation )的每一个组件,包含数学推导、完整代码和微调实战。对标 happy-LLM 的训练实践,我们不仅实现 LoRA 核心,还演示注入、训练、保存/加载和权重合并的完整工程流程。
🔗 配套理论: LoRA 原理与数学推导请参见 01-高效微调技术,应用层面请参见 LLM 应用/10-LoRA 与 QLoRA。
目录¶
| 节 | 内容 | 关键代码 |
|---|---|---|
| 1 | LoRA 数学原理回顾 | 低秩分解公式 |
| 2 | LoRA Layer 实现 | LoRALayer |
| 3 | 带 LoRA 的线性层 | LinearWithLoRA |
| 4 | LoRA 注入 | inject_lora() |
| 5 | 参数管理 | 冻结/解冻/统计 |
| 6 | 权重保存与加载 | save/load_lora_weights() |
| 7 | 权重合并 | merge_lora_weights() |
| 8 | 实战:微调 GPT-2 | 完整训练循环 |
| 9 | QLoRA 简介 | 4-bit 量化+LoRA |
1. LoRA 数学原理回顾¶
核心思想¶
预训练权重 \(W_0 \in \mathbb{R}^{d \times k}\) 在微调时不动,只训练一个低秩增量:
其中: - \(B \in \mathbb{R}^{d \times r}\),\(A \in \mathbb{R}^{r \times k}\) - \(r \ll \min(d, k)\)(秩远小于原始维度) - 可训练参数量:\(r \times (d + k)\) vs 全量微调的 \(d \times k\)
示例: GPT-2 的 W_Q 维度是 \(768 \times 768 = 589,824\) 参数。使用 \(r=8\) 的 LoRA 只需 \(8 \times (768 + 768) = 12,288\) 参数,压缩 48 倍。
2. LoRA Layer 核心实现¶
import torch
import torch.nn as nn
import math
try:
from transformers.pytorch_utils import Conv1D
except Exception: # 教学示例:未安装 transformers 时允许单独阅读本段代码
Conv1D = None
class LoRALayer(nn.Module):
"""
LoRA核心层
实现低秩分解 ΔW = B @ A * (alpha / rank)
初始化策略(论文原文):
- A: Kaiming uniform(保证初始输出有合理方差)
- B: 全零(确保训练初始时 ΔW = 0,不改变预训练行为)
"""
def __init__(self, in_features: int, out_features: int,
rank: int = 8, lora_alpha: float = 16.0):
super().__init__() # super()调用父类方法
self.rank = rank
self.lora_alpha = lora_alpha
self.scaling = lora_alpha / rank # 缩放因子
# 低秩矩阵
# A: [in_features, rank] — 降维
self.lora_A = nn.Parameter(torch.empty(in_features, rank))
# B: [rank, out_features] — 升维
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
# A使用Kaiming初始化
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B使用零初始化(关键!保证训练开始时ΔW=0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: [..., in_features] (支持任意前导维度)
Returns:
[..., out_features]
"""
# x @ A -> [..., rank] -> @ B -> [..., out_features]
return (x @ self.lora_A @ self.lora_B) * self.scaling
形状验证¶
lora = LoRALayer(in_features=768, out_features=768, rank=8, lora_alpha=16)
x = torch.randn(2, 10, 768) # [batch, seq_len, hidden]
delta = lora(x)
assert delta.shape == (2, 10, 768) # assert断言:条件False时抛出AssertionError
# 验证初始输出为零(因为B初始化为0)
assert torch.allclose(delta, torch.zeros_like(delta))
# 参数量检查
params = sum(p.numel() for p in lora.parameters())
assert params == 768 * 8 + 8 * 768 # = 12,288
print(f"✅ LoRALayer 验证通过 | 参数量: {params:,} (全量: {768*768:,}, 压缩率: {768*768/params:.1f}x)")
3. 带 LoRA 的线性层¶
class LinearWithLoRA(nn.Module):
"""
在现有nn.Linear上叠加LoRA
forward: y = W_0 @ x + b + (B @ A @ x) * scaling
"""
def __init__(self, linear: nn.Linear, rank: int = 8, lora_alpha: float = 16.0):
super().__init__()
self.linear = linear # 原始线性层
self.base_layer = linear
# 冻结原始参数
self.linear.weight.requires_grad = False
if self.linear.bias is not None:
self.linear.bias.requires_grad = False
# 添加LoRA
self.lora = LoRALayer(
in_features=linear.in_features,
out_features=linear.out_features,
rank=rank,
lora_alpha=lora_alpha
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
输出 = 原始线性层输出 + LoRA增量
"""
return self.linear(x) + self.lora(x)
@property # @property将方法变为属性访问
def weight(self):
"""兼容性:返回原始权重"""
return self.linear.weight
@property
def bias(self):
"""兼容性:返回原始偏置"""
return self.linear.bias
class Conv1DWithLoRA(nn.Module):
"""
兼容 Hugging Face GPT-2 系列常见的 Conv1D 投影层。
GPT-2 的 c_attn / c_proj 不是 nn.Linear,而是 transformers 的 Conv1D。
"""
def __init__(self, conv1d: "Conv1D", rank: int = 8, lora_alpha: float = 16.0):
if Conv1D is None:
raise ImportError("需要 transformers 才能使用 Conv1DWithLoRA")
super().__init__()
self.conv1d = conv1d
self.base_layer = conv1d
self.conv1d.weight.requires_grad = False
if self.conv1d.bias is not None:
self.conv1d.bias.requires_grad = False
in_features = conv1d.weight.shape[0]
out_features = conv1d.weight.shape[1]
self.lora = LoRALayer(
in_features=in_features,
out_features=out_features,
rank=rank,
lora_alpha=lora_alpha,
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.conv1d(x) + self.lora(x)
验证¶
# 创建一个普通线性层
original_linear = nn.Linear(768, 768)
# 包装为带LoRA的线性层
lora_linear = LinearWithLoRA(original_linear, rank=8, lora_alpha=16)
x = torch.randn(2, 10, 768)
# 训练前,LoRA输出为零,因此结果应与原始线性层一致
out_original = original_linear(x)
out_lora = lora_linear(x)
assert torch.allclose(out_original, out_lora, atol=1e-6)
print("✅ 训练前LoRA输出与原始一致(ΔW=0)")
# 验证梯度
assert original_linear.weight.requires_grad == False, "原始权重应被冻结"
lora_params = [p for p in lora_linear.parameters() if p.requires_grad]
frozen_params = [p for p in lora_linear.parameters() if not p.requires_grad]
print(f"✅ 可训练参数: {sum(p.numel() for p in lora_params):,}")
print(f" 冻结参数: {sum(p.numel() for p in frozen_params):,}")
4. LoRA 注入¶
将 LoRA 注入到模型的指定层(如 Transformer 的 q_proj、v_proj):
def inject_lora(model: nn.Module, target_modules: List[str],
rank: int = 8, lora_alpha: float = 16.0) -> nn.Module:
"""
将LoRA注入到模型的指定模块
Args:
model: 预训练模型
target_modules: 目标模块名(如 ["q_proj", "v_proj", "query", "value"])
rank: LoRA秩
lora_alpha: 缩放参数
Returns:
注入LoRA后的模型
"""
replaced = []
for name, module in list(model.named_modules()):
# 检查该模块的子模块
for child_name, child in list(module.named_children()):
if not any(target in child_name for target in target_modules):
continue
if isinstance(child, nn.Linear):
wrapped = LinearWithLoRA(child, rank=rank, lora_alpha=lora_alpha)
elif Conv1D is not None and isinstance(child, Conv1D):
wrapped = Conv1DWithLoRA(child, rank=rank, lora_alpha=lora_alpha)
else:
continue
setattr(module, child_name, wrapped)
replaced.append(f"{name}.{child_name}" if name else child_name)
if replaced:
print(f"✅ LoRA注入完成 | 已替换 {len(replaced)} 个模块:")
for r in replaced:
print(f" - {r}")
else:
print("⚠️ 未找到匹配的模块,请检查target_modules参数")
return model
def freeze_base_model(model: nn.Module):
"""冻结所有非LoRA参数"""
for name, param in model.named_parameters():
if 'lora_' not in name:
param.requires_grad = False
def unfreeze_lora_parameters(model: nn.Module):
"""解冻所有LoRA参数"""
for name, param in model.named_parameters():
if 'lora_' in name:
param.requires_grad = True
5. 参数统计¶
def print_trainable_parameters(model: nn.Module) -> dict:
"""
统计并打印模型的可训练参数
Returns:
{"trainable": int, "total": int, "ratio": float}
"""
trainable = 0
total = 0
for name, param in model.named_parameters():
total += param.numel()
if param.requires_grad:
trainable += param.numel()
ratio = trainable / total * 100
print(f"可训练参数: {trainable:>12,}")
print(f"总参数: {total:>12,}")
print(f"可训练比例: {ratio:>11.4f}%")
return {"trainable": trainable, "total": total, "ratio": ratio}
6. 权重保存与加载¶
def save_lora_weights(model: nn.Module, save_path: str):
"""
只保存LoRA权重(不保存冻结的基础模型权重)
这让保存文件非常小(通常只有几MB)
"""
lora_state_dict = {}
for name, param in model.named_parameters():
if param.requires_grad and 'lora_' in name:
lora_state_dict[name] = param.data.clone()
torch.save(lora_state_dict, save_path)
total_size = sum(v.numel() * v.element_size() for v in lora_state_dict.values())
print(f"✅ LoRA权重已保存到 {save_path}")
print(f" 参数数量: {sum(v.numel() for v in lora_state_dict.values()):,}")
print(f" 文件大小: {total_size / 1024:.1f} KB")
def load_lora_weights(model: nn.Module, load_path: str):
"""
加载LoRA权重到已注入LoRA的模型
"""
lora_state_dict = torch.load(load_path, map_location='cpu', weights_only=True)
model_state = model.state_dict()
loaded = 0
for name, param in lora_state_dict.items():
if name in model_state:
if model_state[name].shape == param.shape:
model_state[name].copy_(param)
loaded += 1
else:
print(f"⚠️ 形状不匹配: {name} "
f"(模型: {model_state[name].shape}, 文件: {param.shape})")
else:
print(f"⚠️ 跳过不存在的参数: {name}")
print(f"✅ 成功加载 {loaded}/{len(lora_state_dict)} 个LoRA参数")
7. 权重合并(推理部署用)¶
def merge_lora_weights(model: nn.Module) -> nn.Module:
"""
将LoRA权重合并到基础权重中
合并公式: W_merged = W_0 + ΔW * scaling
其中 ΔW = B @ A(论文中的 BA 分解)
合并后不再需要 LoRA 层,推理速度与原始模型一样快。
关于转置 (.T) 的说明:
─────────────────────────────────────────────────────────
nn.Linear.weight 的形状是 [out_features, in_features] = [d, k]
而 LoRA 中存储的 lora_A @ lora_B 的形状是 [k, r] @ [r, d] = [k, d]
两者形状不匹配,所以 Linear 合并时需要 .T 转置:[k, d] → [d, k]
Conv1D.weight 的形状是 [in_features, out_features] = [k, d]
与 lora_A @ lora_B 的 [k, d] 一致,所以 Conv1D 合并时不需要转置
─────────────────────────────────────────────────────────
"""
def replace_module(root: nn.Module, full_name: str, new_module: nn.Module):
if "." in full_name:
parent_name, child_name = full_name.rsplit(".", 1)
parent = root.get_submodule(parent_name)
else:
parent = root
child_name = full_name
setattr(parent, child_name, new_module)
merged_count = 0
for name, module in list(model.named_modules()):
if isinstance(module, LinearWithLoRA):
base = module.base_layer
merged_linear = nn.Linear(
base.in_features,
base.out_features,
bias=base.bias is not None,
)
# lora_A: [k, r], lora_B: [r, d] → A@B: [k, d] → .T: [d, k]
# 与 nn.Linear.weight 的 [out, in] = [d, k] 形状匹配
delta_weight = (module.lora.lora_A @ module.lora.lora_B).T * module.lora.scaling
merged_linear.weight.data.copy_(base.weight.data + delta_weight)
if base.bias is not None:
merged_linear.bias.data.copy_(base.bias.data)
replace_module(model, name, merged_linear)
merged_count += 1
elif Conv1D is not None and isinstance(module, Conv1DWithLoRA):
base = module.base_layer
merged_conv1d = Conv1D(base.weight.shape[1], base.weight.shape[0])
# Conv1D.weight 形状为 [in, out] = [k, d],与 A@B 的 [k, d] 一致,无需转置
delta_weight = (module.lora.lora_A @ module.lora.lora_B) * module.lora.scaling
merged_conv1d.weight.data.copy_(base.weight.data + delta_weight)
merged_conv1d.bias.data.copy_(base.bias.data)
replace_module(model, name, merged_conv1d)
merged_count += 1
print(f"✅ 已合并并替换 {merged_count} 个LoRA层")
return model
8. 实战:微调 GPT-2¶
8.1 完整微调训练代码¶
"""
LoRA微调GPT-2的完整示例
目标: 用LoRA微调GPT-2,使其学会特定风格的文本生成
依赖: pip install "torch>=2.0" transformers datasets
"""
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def lora_finetune_gpt2():
"""使用自实现的LoRA微调GPT-2"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"设备: {device}")
# Step 1: 加载预训练GPT-2
print("加载GPT-2模型...")
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
# 打印原始参数量
total_before = sum(p.numel() for p in model.parameters())
print(f"原始参数量: {total_before:,}")
# Step 2: 注入LoRA
# GPT-2的注意力层命名为 c_attn(包含qkv投影)和 c_proj
# 但c_attn是一个合并的投影(768->2304),我们对c_attn和c_proj都注入
print("\n注入LoRA...")
model = inject_lora(
model,
target_modules=["c_attn", "c_proj"], # GPT-2的注意力投影层
rank=8,
lora_alpha=16
)
# 冻结基础模型
freeze_base_model(model)
unfreeze_lora_parameters(model)
print("\n参数统计:")
stats = print_trainable_parameters(model)
model = model.to(device) # .to(device)将数据移至GPU/CPU
# Step 3: 准备训练数据
# 使用一些示例文本来演示(实际应用中替换为你的数据集)
training_texts = [
"The art of programming is the art of organizing complexity.",
"Good code is its own best documentation.",
"First, solve the problem. Then, write the code.",
"Code is like humor. When you have to explain it, it's bad.",
"Programming is not about typing. It's about thinking.",
"The best error message is the one that never shows up.",
"Make it work, make it right, make it fast.",
"Clean code always looks like it was written by someone who cares.",
"Any fool can write code that a computer can understand.",
"Good programmers write code that humans can understand.",
] * 50 # 重复50次来模拟更多数据
# tokenize
encodings = tokenizer(
training_texts,
truncation=True,
max_length=64,
padding='max_length',
return_tensors='pt'
)
from torch.utils.data import TensorDataset, DataLoader
dataset = TensorDataset(encodings['input_ids'], encodings['attention_mask'])
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
# Step 4: 训练
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=5e-4,
weight_decay=0.01
)
NUM_EPOCHS = 5
print(f"\n开始训练 ({NUM_EPOCHS} epochs)...")
for epoch in range(NUM_EPOCHS):
model.train()
total_loss = 0
for batch_ids, batch_mask in dataloader:
batch_ids = batch_ids.to(device)
batch_mask = batch_mask.to(device)
outputs = model(
input_ids=batch_ids,
attention_mask=batch_mask,
labels=batch_ids # 语言模型的标签就是输入本身
)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
[p for p in model.parameters() if p.requires_grad],
1.0
)
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
print(f"Epoch {epoch+1}/{NUM_EPOCHS} | Loss: {avg_loss:.4f}")
# Step 5: 保存LoRA权重
save_lora_weights(model, "gpt2_lora_weights.pt")
# Step 6: 生成测试
print("\n--- 生成测试 ---")
model.eval()
prompts = ["The art of", "Good code", "Programming is"]
for prompt in prompts:
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
with torch.no_grad(): # 禁用梯度计算,节省内存(推理时使用)
output = model.generate(
input_ids,
max_new_tokens=30,
temperature=0.7,
do_sample=True,
top_k=50,
pad_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f" '{prompt}' → {text}")
# Step 7 (可选): 合并权重用于高效推理
print("\n合并LoRA权重...")
model = merge_lora_weights(model)
print("合并完成! 现在模型可以不依赖LoRA层直接推理。")
if __name__ == "__main__":
lora_finetune_gpt2()
8.2 可能输出(示意)¶
设备: cuda
加载GPT-2模型...
原始参数量: 124,439,808
注入LoRA...
✅ LoRA注入完成 | 已替换若干个模块:
- transformer.h.0.attn.c_attn
- transformer.h.0.attn.c_proj
...
参数统计:
可训练参数: 294,912
总参数: 124,734,720
可训练比例: 0.2364%
开始训练 (5 epochs)...
Epoch 1/5 | Loss: 3.8234
Epoch 2/5 | Loss: 3.2145
Epoch 3/5 | Loss: 2.8901
Epoch 4/5 | Loss: 2.6432
Epoch 5/5 | Loss: 2.4567
✅ LoRA权重已保存到 gpt2_lora_weights.pt
参数数量: 294,912
文件大小: 1152.0 KB
--- 生成测试 ---
'The art of' → The art of programming is about understanding the problem...
'Good code' → Good code is its own best documentation...
合并LoRA权重...
✅ 已合并并替换若干个LoRA层
说明:实际替换的模块数量取决于
transformers版本、模型结构以及target_modules的匹配规则。学习时不要把“24 个模块”这类输出当成硬编码结论。
9. QLoRA 简介¶
QLoRA 在 LoRA 基础上引入 4-bit 量化,进一步降低显存需求:
标准LoRA: 基础模型(FP16) + LoRA(FP16) → 7B模型 ≈ 16GB显存
QLoRA: 基础模型(NF4) + LoRA(BF16) → 7B模型 ≈ 6GB显存
QLoRA 的三项关键技术:
| 技术 | 作用 |
|---|---|
| NormalFloat4 (NF4) | 信息论最优的 4-bit 量化格式 |
| 双重量化 | 对量化常数再量化,节省约 0.4GB/65B 模型 |
| 分页优化器 | GPU 显存不足时自动 page 到 CPU 内存 |
# QLoRA使用示例(使用bitsandbytes库,非从零实现)
# pip install bitsandbytes peft transformers
"""
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 加载4-bit量化模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# 注入LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 3,504,607,232 || 0.12%
"""
📝 自测检查清单¶
- 能画出 LoRA 的 \(W_0 + BA\) 结构图
- 理解为什么 B 初始化为零(保证训练开始时不改变原模型行为)
- 能计算给定 rank 下的参数压缩比
-
inject_lora()能成功替换 GPT-2 的注意力投影层 - 训练后 LoRA 权重文件大小远小于原始模型
- 理解权重合并后推理不需要额外计算
- 能解释 LoRA 与全量微调在数学上的等价性(当 rank 足够大时)
💡 自测参考答案
**1. LoRA 结构图** **2. B 初始化为零的原因** B 初始化为零矩阵,使得训练开始时 $\Delta W = BA = 0$,即 LoRA 的贡献为零,模型输出完全等于原始预训练模型的输出 $h = W_0 x + 0 = W_0 x$。这保证了: - 微调初始阶段不会破坏预训练模型的性能 - 训练从预训练模型的"良好起点"开始 - A 可以使用任意初始化(如 Kaiming),因为 B=0 保证了乘积为零 **3. 参数压缩比计算** 以 `d=4096, rank=16` 为例: - 原始线性层参数:$d \times d = 4096 \times 4096 = 16,777,216$ - LoRA 参数:$d \times r + r \times d = 2 \times 4096 \times 16 = 131,072$ - 压缩比:$131,072 / 16,777,216 \approx 0.78\%$(仅增加不到 1% 的参数) **4. inject_lora 替换 GPT-2 注意力层** GPT-2 使用 `Conv1D` 而非 `nn.Linear`,因此需要 `Conv1DWithLoRA` 适配器: 成功标志:`print_trainable_parameters()` 显示可训练参数占比 < 1%。 **5. LoRA 权重文件大小** 7B 模型的原始权重约 14GB(FP16),LoRA(rank=16, 仅 q_proj+v_proj)权重仅约 20-50MB。 **6. 权重合并后推理零额外开销** 合并公式:$W_{merged} = W_0 + \frac{\alpha}{r} B \cdot A$ 合并后模型与原始模型结构完全相同(都是 `nn.Linear`),推理时不再需要计算 $BAx$,因此: - 推理延迟与原始模型完全一致 - 显存占用与原始模型相同(甚至略少,因为不需要存储 A 和 B) **7. LoRA 与全量微调的数学等价性** 当 rank $r \geq \min(d_{in}, d_{out})$ 时,$BA$ 可以表示任意 $d_{out} \times d_{in}$ 矩阵(满秩),此时 $\Delta W = BA$ 的表达能力与直接优化 $W$ 完全等价。实践中 rank 远小于 $d$(如 rank=16 vs d=4096),所以 LoRA 是对 $\Delta W$ 做了低秩假设,牺牲部分表达能力换取参数效率。📚 参考¶
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
- PEFT Library: https://github.com/HuggingFace/peft
最后更新日期: 2026-03-26 适用版本: LLM 学习教程 v2026