LoRA从零实现完整教程¶
⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。
📌 本章定位:从零实现LoRA(Low-Rank Adaptation)的每一个组件,包含数学推导、完整代码和微调实战。对标 happy-llm 的训练实践,我们不仅实现LoRA核心,还演示注入、训练、保存/加载和权重合并的完整工程流程。
🔗 配套理论:LoRA原理与数学推导请参见 01-高效微调技术,应用层面请参见 LLM应用/10-LoRA与QLoRA。
目录¶
| 节 | 内容 | 关键代码 |
|---|---|---|
| 1 | LoRA数学原理回顾 | 低秩分解公式 |
| 2 | LoRA Layer实现 | LoRALayer |
| 3 | 带LoRA的线性层 | LinearWithLoRA |
| 4 | LoRA注入 | inject_lora() |
| 5 | 参数管理 | 冻结/解冻/统计 |
| 6 | 权重保存与加载 | save/load_lora_weights() |
| 7 | 权重合并 | merge_lora_weights() |
| 8 | 实战:微调GPT-2 | 完整训练循环 |
| 9 | QLoRA简介 | 4-bit量化+LoRA |
1. LoRA数学原理回顾¶
核心思想¶
预训练权重 \(W_0 \in \mathbb{R}^{d \times k}\) 在微调时不动,只训练一个低秩增量:
\[ W = W_0 + \Delta W = W_0 + BA \]
其中: - \(B \in \mathbb{R}^{d \times r}\),\(A \in \mathbb{R}^{r \times k}\) - \(r \ll \min(d, k)\)(秩远小于原始维度) - 可训练参数量:\(r \times (d + k)\) vs 全量微调的 \(d \times k\)
示例:GPT-2的 W_Q 维度是 \(768 \times 768 = 589,824\) 参数。使用 \(r=8\) 的LoRA只需 \(8 \times (768 + 768) = 12,288\) 参数,压缩48倍。
2. LoRA Layer核心实现¶
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import List, Optional
class LoRALayer(nn.Module):
"""
LoRA核心层
实现低秩分解 ΔW = B @ A * (alpha / rank)
初始化策略(论文原文):
- A: Kaiming uniform(保证初始输出有合理方差)
- B: 全零(确保训练初始时 ΔW = 0,不改变预训练行为)
"""
def __init__(self, in_features: int, out_features: int,
rank: int = 8, lora_alpha: float = 16.0):
super().__init__() # super()调用父类方法
self.rank = rank
self.lora_alpha = lora_alpha
self.scaling = lora_alpha / rank # 缩放因子
# 低秩矩阵
# A: [in_features, rank] — 降维
self.lora_A = nn.Parameter(torch.empty(in_features, rank))
# B: [rank, out_features] — 升维
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
# A使用Kaiming初始化
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B使用零初始化(关键!保证训练开始时ΔW=0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: [..., in_features] (支持任意前导维度)
Returns:
[..., out_features]
"""
# x @ A -> [..., rank] -> @ B -> [..., out_features]
return (x @ self.lora_A @ self.lora_B) * self.scaling
形状验证¶
Python
lora = LoRALayer(in_features=768, out_features=768, rank=8, lora_alpha=16)
x = torch.randn(2, 10, 768) # [batch, seq_len, hidden]
delta = lora(x)
assert delta.shape == (2, 10, 768) # assert断言:条件False时抛出AssertionError
# 验证初始输出为零(因为B初始化为0)
assert torch.allclose(delta, torch.zeros_like(delta))
# 参数量检查
params = sum(p.numel() for p in lora.parameters())
assert params == 768 * 8 + 8 * 768 # = 12,288
print(f"✅ LoRALayer 验证通过 | 参数量: {params:,} (全量: {768*768:,}, 压缩率: {768*768/params:.1f}x)")
3. 带LoRA的线性层¶
Python
class LinearWithLoRA(nn.Module):
"""
在现有nn.Linear上叠加LoRA
forward: y = W_0 @ x + b + (B @ A @ x) * scaling
"""
def __init__(self, linear: nn.Linear, rank: int = 8, lora_alpha: float = 16.0):
super().__init__()
self.linear = linear # 原始线性层
# 冻结原始参数
self.linear.weight.requires_grad = False
if self.linear.bias is not None:
self.linear.bias.requires_grad = False
# 添加LoRA
self.lora = LoRALayer(
in_features=linear.in_features,
out_features=linear.out_features,
rank=rank,
lora_alpha=lora_alpha
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
输出 = 原始线性层输出 + LoRA增量
"""
return self.linear(x) + self.lora(x)
@property # @property将方法变为属性访问
def weight(self):
"""兼容性:返回原始权重"""
return self.linear.weight
@property
def bias(self):
"""兼容性:返回原始偏置"""
return self.linear.bias
验证¶
Python
# 创建一个普通线性层
original_linear = nn.Linear(768, 768)
# 包装为带LoRA的线性层
lora_linear = LinearWithLoRA(original_linear, rank=8, lora_alpha=16)
x = torch.randn(2, 10, 768)
# 训练前,LoRA输出为零,因此结果应与原始线性层一致
out_original = original_linear(x)
out_lora = lora_linear(x)
assert torch.allclose(out_original, out_lora, atol=1e-6)
print("✅ 训练前LoRA输出与原始一致(ΔW=0)")
# 验证梯度
assert original_linear.weight.requires_grad == False, "原始权重应被冻结"
lora_params = [p for p in lora_linear.parameters() if p.requires_grad]
frozen_params = [p for p in lora_linear.parameters() if not p.requires_grad]
print(f"✅ 可训练参数: {sum(p.numel() for p in lora_params):,}")
print(f" 冻结参数: {sum(p.numel() for p in frozen_params):,}")
4. LoRA注入¶
将LoRA注入到模型的指定层(如Transformer的 q_proj、v_proj):
Python
def inject_lora(model: nn.Module, target_modules: List[str],
rank: int = 8, lora_alpha: float = 16.0) -> nn.Module:
"""
将LoRA注入到模型的指定模块
Args:
model: 预训练模型
target_modules: 目标模块名(如 ["q_proj", "v_proj", "query", "value"])
rank: LoRA秩
lora_alpha: 缩放参数
Returns:
注入LoRA后的模型
"""
replaced = []
for name, module in model.named_modules():
# 检查该模块的子模块
for child_name, child in module.named_children():
if isinstance(child, nn.Linear): # isinstance检查类型
# 检查名称是否匹配任何目标
if any(target in child_name for target in target_modules): # any()任一为True则返回True
# 替换为带LoRA的版本
lora_module = LinearWithLoRA(child, rank=rank, lora_alpha=lora_alpha)
setattr(module, child_name, lora_module) # setattr动态设置对象属性
replaced.append(f"{name}.{child_name}" if name else child_name)
if replaced:
print(f"✅ LoRA注入完成 | 已替换 {len(replaced)} 个模块:")
for r in replaced:
print(f" - {r}")
else:
print("⚠️ 未找到匹配的模块,请检查target_modules参数")
return model
def freeze_base_model(model: nn.Module):
"""冻结所有非LoRA参数"""
for name, param in model.named_parameters():
if 'lora_' not in name:
param.requires_grad = False
def unfreeze_lora_parameters(model: nn.Module):
"""解冻所有LoRA参数"""
for name, param in model.named_parameters():
if 'lora_' in name:
param.requires_grad = True
5. 参数统计¶
Python
def print_trainable_parameters(model: nn.Module) -> dict:
"""
统计并打印模型的可训练参数
Returns:
{"trainable": int, "total": int, "ratio": float}
"""
trainable = 0
total = 0
for name, param in model.named_parameters():
total += param.numel()
if param.requires_grad:
trainable += param.numel()
ratio = trainable / total * 100
print(f"可训练参数: {trainable:>12,}")
print(f"总参数: {total:>12,}")
print(f"可训练比例: {ratio:>11.4f}%")
return {"trainable": trainable, "total": total, "ratio": ratio}
6. 权重保存与加载¶
Python
def save_lora_weights(model: nn.Module, save_path: str):
"""
只保存LoRA权重(不保存冻结的基础模型权重)
这让保存文件非常小(通常只有几MB)
"""
lora_state_dict = {}
for name, param in model.named_parameters():
if param.requires_grad and 'lora_' in name:
lora_state_dict[name] = param.data.clone()
torch.save(lora_state_dict, save_path)
total_size = sum(v.numel() * v.element_size() for v in lora_state_dict.values())
print(f"✅ LoRA权重已保存到 {save_path}")
print(f" 参数数量: {sum(v.numel() for v in lora_state_dict.values()):,}")
print(f" 文件大小: {total_size / 1024:.1f} KB")
def load_lora_weights(model: nn.Module, load_path: str):
"""
加载LoRA权重到已注入LoRA的模型
"""
lora_state_dict = torch.load(load_path, map_location='cpu', weights_only=True)
model_state = model.state_dict()
loaded = 0
for name, param in lora_state_dict.items():
if name in model_state:
if model_state[name].shape == param.shape:
model_state[name].copy_(param)
loaded += 1
else:
print(f"⚠️ 形状不匹配: {name} "
f"(模型: {model_state[name].shape}, 文件: {param.shape})")
else:
print(f"⚠️ 跳过不存在的参数: {name}")
print(f"✅ 成功加载 {loaded}/{len(lora_state_dict)} 个LoRA参数")
7. 权重合并(推理部署用)¶
Python
def merge_lora_weights(model: nn.Module) -> nn.Module:
"""
将LoRA权重合并到基础权重中
合并后: W_merged = W_0 + B @ A * scaling
合并后不再需要LoRA层,推理速度与原始模型一样快
"""
merged_count = 0
for name, module in model.named_modules():
if isinstance(module, LinearWithLoRA):
# 获取原始权重和LoRA权重
W_0 = module.linear.weight.data # [out, in]
lora_A = module.lora.lora_A.data # [in, rank]
lora_B = module.lora.lora_B.data # [rank, out]
scaling = module.lora.scaling
# 形状断言: 确保LoRA矩阵维度兼容
assert lora_A.shape[0] == W_0.shape[1], \
f"LoRA A rows ({lora_A.shape[0]}) != weight in_features ({W_0.shape[1]})"
assert lora_B.shape[1] == W_0.shape[0], \
f"LoRA B cols ({lora_B.shape[1]}) != weight out_features ({W_0.shape[0]})"
assert lora_A.shape[1] == lora_B.shape[0], \
f"LoRA rank mismatch: A cols ({lora_A.shape[1]}) != B rows ({lora_B.shape[0]})"
# 合并: W = W_0 + (B @ A)^T * scaling
# 注意: nn.Linear存储的是 [out_features, in_features]
# LoRA计算的是 x @ A @ B,等价于 (B^T @ A^T) @ x^T
delta_W = (lora_B.T @ lora_A.T) * scaling # [out, in]
module.linear.weight.data = W_0 + delta_W
# 合并后清零LoRA权重,避免forward时重复叠加
module.lora.lora_B.data.zero_()
merged_count += 1
print(f"✅ 已合并 {merged_count} 个LoRA层的权重")
return model
8. 实战:微调GPT-2¶
8.1 完整微调训练代码¶
Python
"""
LoRA微调GPT-2的完整示例
目标: 用LoRA微调GPT-2,使其学会特定风格的文本生成
依赖: pip install torch transformers datasets
"""
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def lora_finetune_gpt2():
"""使用自实现的LoRA微调GPT-2"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"设备: {device}")
# Step 1: 加载预训练GPT-2
print("加载GPT-2模型...")
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
# 打印原始参数量
total_before = sum(p.numel() for p in model.parameters())
print(f"原始参数量: {total_before:,}")
# Step 2: 注入LoRA
# GPT-2的注意力层命名为 c_attn(包含qkv投影)和 c_proj
# 但c_attn是一个合并的投影(768->2304),我们对c_attn和c_proj都注入
print("\n注入LoRA...")
model = inject_lora(
model,
target_modules=["c_attn", "c_proj"], # GPT-2的注意力投影层
rank=8,
lora_alpha=16
)
# 冻结基础模型
freeze_base_model(model)
unfreeze_lora_parameters(model)
print("\n参数统计:")
stats = print_trainable_parameters(model)
model = model.to(device) # .to(device)将数据移至GPU/CPU
# Step 3: 准备训练数据
# 使用一些示例文本来演示(实际应用中替换为你的数据集)
training_texts = [
"The art of programming is the art of organizing complexity.",
"Good code is its own best documentation.",
"First, solve the problem. Then, write the code.",
"Code is like humor. When you have to explain it, it's bad.",
"Programming is not about typing. It's about thinking.",
"The best error message is the one that never shows up.",
"Make it work, make it right, make it fast.",
"Clean code always looks like it was written by someone who cares.",
"Any fool can write code that a computer can understand.",
"Good programmers write code that humans can understand.",
] * 50 # 重复50次来模拟更多数据
# tokenize
encodings = tokenizer(
training_texts,
truncation=True,
max_length=64,
padding='max_length',
return_tensors='pt'
)
from torch.utils.data import TensorDataset, DataLoader
dataset = TensorDataset(encodings['input_ids'], encodings['attention_mask'])
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
# Step 4: 训练
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=5e-4,
weight_decay=0.01
)
NUM_EPOCHS = 5
print(f"\n开始训练 ({NUM_EPOCHS} epochs)...")
for epoch in range(NUM_EPOCHS):
model.train()
total_loss = 0
for batch_ids, batch_mask in dataloader:
batch_ids = batch_ids.to(device)
batch_mask = batch_mask.to(device)
outputs = model(
input_ids=batch_ids,
attention_mask=batch_mask,
labels=batch_ids # 语言模型的标签就是输入本身
)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
[p for p in model.parameters() if p.requires_grad],
1.0
)
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
print(f"Epoch {epoch+1}/{NUM_EPOCHS} | Loss: {avg_loss:.4f}")
# Step 5: 保存LoRA权重
save_lora_weights(model, "gpt2_lora_weights.pt")
# Step 6: 生成测试
print("\n--- 生成测试 ---")
model.eval()
prompts = ["The art of", "Good code", "Programming is"]
for prompt in prompts:
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
with torch.no_grad(): # 禁用梯度计算,节省内存(推理时使用)
output = model.generate(
input_ids,
max_new_tokens=30,
temperature=0.7,
do_sample=True,
top_k=50,
pad_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f" '{prompt}' → {text}")
# Step 7 (可选): 合并权重用于高效推理
print("\n合并LoRA权重...")
model = merge_lora_weights(model)
print("合并完成! 现在模型可以不依赖LoRA层直接推理。")
if __name__ == "__main__":
lora_finetune_gpt2()
8.2 预期输出¶
Text Only
设备: cuda
加载GPT-2模型...
原始参数量: 124,439,808
注入LoRA...
✅ LoRA注入完成 | 已替换 24 个模块:
- transformer.h.0.attn.c_attn
- transformer.h.0.attn.c_proj
...
参数统计:
可训练参数: 294,912
总参数: 124,734,720
可训练比例: 0.2364%
开始训练 (5 epochs)...
Epoch 1/5 | Loss: 3.8234
Epoch 2/5 | Loss: 3.2145
Epoch 3/5 | Loss: 2.8901
Epoch 4/5 | Loss: 2.6432
Epoch 5/5 | Loss: 2.4567
✅ LoRA权重已保存到 gpt2_lora_weights.pt
参数数量: 294,912
文件大小: 1152.0 KB
--- 生成测试 ---
'The art of' → The art of programming is about understanding the problem...
'Good code' → Good code is its own best documentation...
合并LoRA权重...
✅ 已合并 24 个LoRA层的权重
9. QLoRA简介¶
QLoRA在LoRA基础上引入4-bit量化,进一步降低显存需求:
Text Only
标准LoRA: 基础模型(FP16) + LoRA(FP16) → 7B模型 ≈ 16GB显存
QLoRA: 基础模型(NF4) + LoRA(BF16) → 7B模型 ≈ 6GB显存
QLoRA的三项关键技术:
| 技术 | 作用 |
|---|---|
| NormalFloat4 (NF4) | 信息论最优的4-bit量化格式 |
| 双重量化 | 对量化常数再量化,节省约0.4GB/65B模型 |
| 分页优化器 | GPU显存不足时自动page到CPU内存 |
Python
# QLoRA使用示例(使用bitsandbytes库,非从零实现)
# pip install bitsandbytes peft transformers
"""
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 加载4-bit量化模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# 注入LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 3,504,607,232 || 0.12%
"""
📝 自测检查清单¶
- 能画出LoRA的 \(W_0 + BA\) 结构图
- 理解为什么B初始化为零(保证训练开始时不改变原模型行为)
- 能计算给定rank下的参数压缩比
-
inject_lora()能成功替换GPT-2的注意力投影层 - 训练后LoRA权重文件大小远小于原始模型
- 理解权重合并后推理不需要额外计算
- 能解释LoRA与全量微调在数学上的等价性(当rank足够大时)
📚 参考¶
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
- PEFT Library: https://github.com/huggingface/peft