推荐系统评估¶
📖 章节导读¶
评估是推荐系统开发的重要环节,能够帮助我们了解模型效果、优化算法。本章将介绍推荐系统评估的方法、指标和实际应用。
🎯 学习目标¶
- 理解评估的重要性
- 掌握离线评估方法
- 掌握在线评估方法
- 掌握A/B测试
- 能够设计评估方案
13.1 评估概述¶
13.1.1 评估层次¶
离线评估: - 在历史数据上评估 - 快速、低成本 - 不能完全反映在线效果
在线评估: - 在真实用户上评估 - 准确但成本高 - 需要A/B测试
13.1.2 评估维度¶
| 维度 | 指标 | 说明 |
|---|---|---|
| 准确性 | Precision, Recall, AUC | 推荐准确性 |
| 覆盖度 | Coverage | 推荐物品的多样性 |
| 多样性 | Diversity | 推荐列表的多样性 |
| 新颖性 | Novelty | 推荐新物品的能力 |
| 时效性 | Latency | 系统响应时间 |
13.2 离线评估¶
13.2.1 准确性指标¶
Precision@K:
def precision_at_k(y_true, y_pred, k):
"""
Precision@K
y_true: 真实标签
y_pred: 预测列表
k: Top K
"""
top_k = y_pred[:k]
hits = len(set(top_k) & set(y_true))
return hits / k
Recall@K:
def recall_at_k(y_true, y_pred, k):
"""
Recall@K
"""
top_k = y_pred[:k]
hits = len(set(top_k) & set(y_true))
return hits / len(y_true)
F1@K:
def f1_at_k(y_true, y_pred, k):
"""
F1@K
"""
precision = precision_at_k(y_true, y_pred, k)
recall = recall_at_k(y_true, y_pred, k)
if precision + recall == 0:
return 0
return 2 * precision * recall / (precision + recall)
MAP(Mean Average Precision):
def average_precision(y_true, y_pred):
"""
Average Precision
"""
hits = 0
sum_precision = 0
for i, item in enumerate(y_pred): # enumerate同时获取索引和元素
if item in y_true:
hits += 1
precision = hits / (i + 1)
sum_precision += precision
return sum_precision / len(y_true)
def mean_average_precision(y_true_list, y_pred_list):
"""
Mean Average Precision
"""
aps = [average_precision(y_true, y_pred)
for y_true, y_pred in zip(y_true_list, y_pred_list)] # zip按位置配对
return np.mean(aps)
NDCG(Normalized Discounted Cumulative Gain):
def dcg_at_k(y_true, y_pred, k):
"""
DCG@K
"""
dcg = 0
for i, item in enumerate(y_pred[:k]):
if item in y_true:
relevance = 1 # 二值相关性
dcg += relevance / np.log2(i + 2)
return dcg
def ndcg_at_k(y_true, y_pred, k):
"""
NDCG@K
"""
dcg = dcg_at_k(y_true, y_pred, k)
# 理想DCG
ideal_pred = y_true[:k]
idcg = dcg_at_k(y_true, ideal_pred, k)
if idcg == 0:
return 0
return dcg / idcg
13.2.2 覆盖度指标¶
物品覆盖率:
def item_coverage(all_items, recommended_items):
"""
物品覆盖率
"""
return len(set(recommended_items)) / len(set(all_items))
用户覆盖率:
def user_coverage(all_users, users_with_recommendations):
"""
用户覆盖率
"""
return len(users_with_recommendations) / len(all_users)
13.2.3 多样性指标¶
列表内多样性:
def intra_list_diversity(recommendations, item_features):
"""
列表内多样性
"""
if len(recommendations) <= 1:
return 0
total_similarity = 0
count = 0
for i in range(len(recommendations)):
for j in range(i+1, len(recommendations)):
item1 = recommendations[i]
item2 = recommendations[j]
# 计算相似度(简化的余弦相似度)
feat1 = item_features.get(item1, [])
feat2 = item_features.get(item2, [])
similarity = len(set(feat1) & set(feat2)) / \
len(set(feat1) | set(feat2))
total_similarity += similarity
count += 1
if count == 0:
return 0
return 1 - total_similarity / count
13.2.4 新颖性指标¶
流行度:
def novelty(recommendations, item_popularity):
"""
新颖性(基于流行度)
"""
avg_popularity = np.mean([
item_popularity.get(item, 0)
for item in recommendations
])
# 归一化
max_popularity = max(item_popularity.values())
novelty = 1 - avg_popularity / max_popularity
return novelty
13.3 在线评估¶
13.3.1 用户行为指标¶
CTR(Click-Through Rate):
def calculate_ctr(clicks, impressions):
"""
计算CTR
"""
if impressions == 0:
return 0
return clicks / impressions
CVR(Conversion Rate):
def calculate_cvr(conversions, clicks):
"""
计算CVR
"""
if clicks == 0:
return 0
return conversions / clicks
停留时长:
def avg_dwell_time(dwell_times):
"""
平均停留时长
"""
if not dwell_times:
return 0
return np.mean(dwell_times)
13.3.2 业务指标¶
GMV(Gross Merchandise Value):
留存率:
def retention_rate(active_users, total_users):
"""
留存率
"""
if total_users == 0:
return 0
return active_users / total_users
13.4 A/B测试¶
13.4.1 测试设计¶
基本流程: 1. 流量分流:随机分配用户到实验组和对照组 2. 实验组:使用新算法 3. 对照组:使用旧算法 4. 数据收集:收集关键指标 5. 统计分析:判断差异是否显著
13.4.2 流量分流¶
import hashlib
def assign_group(user_id, test_ratio=0.5):
"""
分配实验组
user_id: 用户ID
test_ratio: 实验组比例
"""
# 使用哈希确保一致性
hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
if hash_value % 100 < test_ratio * 100:
return 'test'
else:
return 'control'
13.4.3 统计检验¶
t检验:
from scipy import stats
def t_test(control_values, test_values, alpha=0.05):
"""
t检验
"""
# 计算均值和标准差
control_mean = np.mean(control_values)
test_mean = np.mean(test_values)
control_std = np.std(control_values)
test_std = np.std(test_values)
# t检验
t_stat, p_value = stats.ttest_ind(test_values, control_values)
# 判断是否显著
is_significant = p_value < alpha
return {
'control_mean': control_mean,
'test_mean': test_mean,
'lift': (test_mean - control_mean) / control_mean,
't_stat': t_stat,
'p_value': p_value,
'is_significant': is_significant
}
卡方检验:
def chi_square_test(observed, expected):
"""
卡方检验
observed: 观测值
expected: 期望值
"""
# 计算卡方统计量
chi2 = sum((o - e)**2 / e for o, e in zip(observed, expected))
# 计算p值
# 注意:此处 dof = len(observed) - 1 适用于拟合优度检验
# 若用于 A/B 测试的 2×2 列联表,应使用 dof = (rows-1)*(cols-1) = 1
dof = len(observed) - 1
p_value = 1 - stats.chi2.cdf(chi2, dof)
return {
'chi2': chi2,
'p_value': p_value,
'is_significant': p_value < 0.05
}
13.5 实战案例¶
案例:离线评估¶
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# 1. 加载数据
data = pd.read_csv('recommendation_data.csv')
# 2. 划分数据集
train, test = train_test_split(data, test_size=0.2, random_state=42)
# 3. 训练模型
model = train_model(train)
# 4. 评估
def evaluate_model(model, test):
"""评估模型"""
precisions = []
recalls = []
ndcgs = []
for user_id, group in test.groupby('user_id'):
# 获取真实物品
y_true = group['item_id'].tolist()
# 预测
y_pred = model.recommend(user_id, n=10)
# 计算指标
precision = precision_at_k(y_true, y_pred, k=10)
recall = recall_at_k(y_true, y_pred, k=10)
ndcg = ndcg_at_k(y_true, y_pred, k=10)
precisions.append(precision)
recalls.append(recall)
ndcgs.append(ndcg)
# 平均指标
avg_precision = np.mean(precisions)
avg_recall = np.mean(recalls)
avg_ndcg = np.mean(ndcgs)
return {
'precision@10': avg_precision,
'recall@10': avg_recall,
'ndcg@10': avg_ndcg
}
# 5. 运行评估
results = evaluate_model(model, test)
print("离线评估结果:")
for metric, value in results.items():
print(f"{metric}: {value:.4f}")
案例:A/B测试¶
import pandas as pd
import numpy as np
from scipy import stats
# 1. 加载A/B测试数据
ab_data = pd.read_csv('ab_test_data.csv')
# 2. 计算各组指标
control_data = ab_data[ab_data['group'] == 'control']
test_data = ab_data[ab_data['group'] == 'test']
control_ctr = control_data['click'].mean()
test_ctr = test_data['click'].mean()
print(f"Control CTR: {control_ctr:.4f}")
print(f"Test CTR: {test_ctr:.4f}")
print(f"Lift: {(test_ctr - control_ctr) / control_ctr:.2%}")
# 3. 统计检验
control_clicks = control_data['click'].values
test_clicks = test_data['click'].values
result = t_test(control_clicks, test_clicks)
print("\n统计检验结果:")
print(f"t-statistic: {result['t_stat']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"显著: {result['is_significant']}")
📝 本章小结¶
本章介绍了推荐系统评估:
- ✅ 评估的重要性和层次
- ✅ 离线评估方法
- ✅ 在线评估方法
- ✅ A/B测试
- ✅ 实战案例
通过本章学习,你应该能够: - 理解评估的重要性 - 计算各种评估指标 - 设计A/B测试 - 分析评估结果 - 优化推荐系统
🔗 下一步¶
下一章我们将学习大规模推荐系统,了解如何构建可扩展的推荐系统。
继续学习: 14-大规模推荐系统.md
💡 思考题¶
-
离线评估和在线评估各有什么优缺点?
离线评估:快速、可重复、低成本,但与线上效果可能差异大(位置偏差/反馈循环/新业务)。在线评估(A/B测试):最精确、直接反映业务价值,但耗时、需足够流量、可能损失用户体验。实践:离线快速筛选(AUC有提升才进入AB)→小流量AB→全量上线。
-
如何选择合适的评估指标?
按场景选:①CTR预估→AUC/GAUC/LogLoss ②排序质量→NDCG/MAP ③召回→Recall@K/HitRate@K ④业务级→CTR/CVR/GMV/留存/时长 ⑤用户体验→多样性/新颖性/覆盖率。原则:离线用AUC/NDCG快筛,线上以业务指标为准。
-
A/B测试如何设计才能有效?
①随机分流(用户级Hash分桶,保证组间无偏差) ②样本量充足(统计力分析确定最小样本量) ③跑够时间(通常≥7天,覆盖周末效应) ④显著性检验(p-value<0.05) ⑤避免多Peek(提前看结果导致误判) ⑥平行实验互斥(避免实验互相干扰)。分层实验架构可支持多组并行。
-
如何平衡准确性和多样性?
纯追求准确性→信息茧(只推用户最喜欢的一类)。平衡方法:①重排时加多样性约束(MMR/DPP) ②响应层融合多样性分数(score = α×relevance + β×diversity) ③流量配额(少量位置给探索内容) ④监控多样性指标(entropy/ILS)并设置下限。
-
如何评估推荐系统的长期效果?
短期指标(CTR/CVR)不代表长期价值。长期指标:①用户留存曲线(7天/30天) ②用户兴趣广度变化 ③创作者生态健康度 ④用户LTV(Life Time Value) ⑤生态多样性(Gini系数)。方法:长期A/B测试(Hold-out组长期保留) + 反事实分析。
📚 参考资料¶
- "Evaluating Recommendation Systems" - Shani & Gunawardana
- "Recommender Systems Handbook" - Chapter 8
- "Trustworthy Online Controlled Experiments" - Kohavi et al.
- "Statistical Significance in A/B Testing" - Deng
- Scikit-learn Documentation
