08 - 特征工程¶
🎯 为什么特征工程如此重要?¶
Text Only
Garbage In, Garbage Out
好特征 + 简单模型 > 差特征 + 复杂模型
特征工程的回报:
- 80%的建模时间花在特征工程上
- 好的特征可以让简单模型达到SOTA效果
- 是区分数据科学家水平的关键
📊 特征类型¶
数值特征 (Numerical)¶
类别特征 (Categorical)¶
时间特征 (Temporal)¶
文本特征 (Text)¶
地理特征 (Geospatial)¶
🔧 数值特征处理¶
1. 缺失值处理¶
识别缺失值:
Python
import pandas as pd
import numpy as np
# 查看缺失情况
df.isnull().sum()
df.isnull().mean() # 缺失比例
# 可视化
import missingno as msno
msno.matrix(df)
msno.bar(df)
处理策略:
删除¶
Python
# 删除缺失值过多的行(使用赋值替代inplace=True)
df = df.dropna(thresh=len(df.columns) * 0.7)
# 删除缺失值过多的列(使用赋值替代inplace=True)
df = df.dropna(thresh=len(df) * 0.7, axis=1)
填充¶
Python
# 均值填充 (正态分布)
df['column'] = df['column'].fillna(df['column'].mean())
# 中位数填充 (有异常值)
df['column'] = df['column'].fillna(df['column'].median())
# 众数填充 (类别特征)
df['column'] = df['column'].fillna(df['column'].mode()[0])
# 前向填充 (时序数据)
df['column'] = df['column'].ffill()
# 插值
df['column'] = df['column'].interpolate(method='linear')
# 标记缺失 (保留信息)
df['column_missing'] = df['column'].isnull().astype(int)
df['column'] = df['column'].fillna(-999)
高级方法¶
Python
# KNN填充
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)
# 迭代填充 (MICE)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_filled = imputer.fit_transform(df)
2. 异常值检测与处理¶
检测方法:
箱线图 (IQR方法)¶
Python
def detect_outliers_iqr(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (series < lower_bound) | (series > upper_bound)
outliers = detect_outliers_iqr(df['column'])
print(f"异常值数量: {outliers.sum()}")
Z-Score方法¶
Python
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
outliers = z_scores > 3 # 3个标准差
可视化¶
Python
import matplotlib.pyplot as plt
# 箱线图
plt.boxplot(df['column'])
plt.show()
# 直方图
plt.hist(df['column'], bins=50)
plt.show()
# 散点图 (多变量)
plt.scatter(df['feature1'], df['feature2'])
plt.show()
处理策略:
Python
# 删除
df = df[~outliers]
# 盖帽法 (Capping)
upper_limit = df['column'].quantile(0.99)
lower_limit = df['column'].quantile(0.01)
df['column'] = df['column'].clip(lower_limit, upper_limit)
# 对数变换 (偏态数据)
df['column'] = np.log1p(df['column']) # log(1+x)
# Box-Cox变换
from scipy import stats
transformed, lambda_param = stats.boxcox(df['column'] + 1)
3. 特征缩放¶
为什么需要缩放? - 梯度下降收敛更快 - 距离相关算法 (KNN, K-Means, SVM) - 神经网络
标准化 (Z-Score Normalization)¶
Python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 结果: 均值=0, 标准差=1
# 适用于: 正态分布数据
最小-最大缩放¶
Python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)
# 结果: [0, 1]区间
# 适用于: 神经网络,图像像素
鲁棒缩放 (Robust Scaler)¶
Python
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
# 使用中位数和四分位数,对异常值鲁棒
# 适用于: 有异常值的数据
最大值缩放 (MaxAbsScaler)¶
Python
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
# 缩放到[-1, 1],保留稀疏性
# 适用于: 稀疏数据
重要: 测试集使用训练集的scaler!
Python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # 注意: 只transform!
4. 特征变换¶
对数变换¶
指数变换¶
Box-Cox变换 (自动选择λ)¶
Python
from scipy import stats
# 仅适用于正数
transformed, lambda_param = stats.boxcox(df['feature'])
print(f"最优λ: {lambda_param:.2f}")
# λ = 1: 不变
# λ = 0: 对数变换
# λ = -1: 倒数
Yeo-Johnson变换 (支持负数)¶
Python
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X)
🏷️ 类别特征处理¶
1. 标签编码 (Label Encoding)¶
Python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
# 结果: 0, 1, 2, 3, ...
# 适用于: 有序类别 (低中高)
# 问题: 引入顺序关系 (城市大小无意义)
2. 独热编码 (One-Hot Encoding)¶
Python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first') # 避免多重共线性
encoded = encoder.fit_transform(df[['city']])
# 或用pandas
df = pd.get_dummies(df, columns=['city'], drop_first=True)
# 结果: city_Beijing, city_Shanghai, city_Guangzhou, ...
# 适用于: 名义类别,基数较小 (<10)
3. 目标编码 (Target Encoding)¶
Python
# 用目标变量的均值替代类别
mean_target = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(mean_target)
# 适用于: 高基数类别 (如1000个城市)
# 注意: 可能导致过拟合 (使用平滑或交叉验证)
4. 频率编码¶
Python
# 用类别出现频率
frequency = df['category'].value_counts(normalize=True)
df['category_freq'] = df['category'].map(frequency)
5. 嵌入编码 (Embedding)¶
🕐 时间特征工程¶
Python
import pandas as pd
# 转换为时间类型
df['date'] = pd.to_datetime(df['date'])
# 基础特征
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
# 高级特征
df['quarter'] = df['date'].dt.quarter
df['weekofyear'] = df['date'].dt.isocalendar().week
df['dayofyear'] = df['date'].dt.dayofyear
# 是否特征
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)
# 周期性编码 (避免12月比1月"大")
def encode_cyclical(feature, max_val):
return (
np.sin(2 * np.pi * feature / max_val),
np.cos(2 * np.pi * feature / max_val)
)
df['month_sin'], df['month_cos'] = encode_cyclical(df['month'], 12)
df['dayofweek_sin'], df['dayofweek_cos'] = encode_cyclical(df['dayofweek'], 7)
# 时间差特征
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
df['days_to_end'] = (df['date'].max() - df['date']).dt.days
📝 文本特征工程¶
1. 基础统计特征¶
Python
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].str.len() / df['word_count']
df['uppercase_count'] = df['text'].str.count(r'[A-Z]')
df['exclamation_count'] = df['text'].str.count(r'!')
df['question_count'] = df['text'].str.count(r'\?')
2. TF-IDF¶
Python
from sklearn.feature_extraction.text import TfidfVectorizer
# 单词级别
vectorizer = TfidfVectorizer(
max_features=5000,
stop_words='english',
ngram_range=(1, 2), # unigram和bigram
min_df=5, # 至少出现5次
max_df=0.8 # 最多出现在80%文档中
)
tfidf_matrix = vectorizer.fit_transform(df['text'])
feature_names = vectorizer.get_feature_names_out()
3. 词袋模型 (Bag of Words)¶
Python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
max_features=5000,
stop_words='english',
binary=True # 只关心是否出现
)
bow_matrix = vectorizer.fit_transform(df['text'])
4. N-grams¶
Python
# Bigram (相邻两个词)
vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = vectorizer.fit_transform(df['text'])
🗺️ 特征生成与组合¶
1. 四则运算¶
Python
# 比率特征
df['price_per_sqft'] = df['price'] / df['sqft']
df['rooms_per_person'] = df['rooms'] / df['household_size']
# 差值
df['price_diff'] = df['current_price'] - df['original_price']
# 累积和
df['cumulative_sum'] = df['value'].cumsum()
2. 多项式特征¶
Python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(
degree=2,
interaction_only=True, # 只生成交互项
include_bias=False
)
X_poly = poly.fit_transform(X)
# 生成: x1, x2, x1x2(interaction_only=True 排除了 x1², x2² 等自交互项)
3. 分桶 (Binning / Discretization)¶
Python
# 等宽分桶
df['age_bin'] = pd.cut(df['age'], bins=5, labels=False)
# 等频分桶
df['income_bin'] = pd.qcut(df['income'], q=4, labels=False)
# 自定义分桶
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 50, 100],
labels=['儿童', '青年', '中年', '老年']
)
4. 交互特征¶
Python
# 类别交互
df['city_gender'] = df['city'] + '_' + df['gender']
# 数值交互
df['age_income'] = df['age'] * df['income']
📉 降维¶
PCA (主成分分析)¶
Python
from sklearn.decomposition import PCA
# 选择保留95%方差
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"原始维度: {X.shape[1]}")
print(f"降维后: {X_pca.shape[1]}")
print(f"解释方差比: {pca.explained_variance_ratio_.cumsum()}")
特征选择¶
Python
# 移除低方差特征
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
# 单变量选择
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# 基于模型的特征选择
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100),
threshold='median'
)
X_selected = selector.fit_transform(X, y)
🔄 特征工程完整流程¶
Python
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# 定义列类型
numeric_features = ['age', 'income', 'score']
categorical_features = ['city', 'gender', 'education']
# 数值特征处理
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# 类别特征处理
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 组合
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# 完整流程
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
# 训练
clf.fit(X_train, y_train)
# 预测
y_pred = clf.predict(X_test)
💡 特征工程最佳实践¶
检查清单¶
Text Only
□ 缺失值是否正确处理?
□ 异常值是否检查和处理?
□ 类别特征是否正确编码?
□ 时间特征是否充分提取?
□ 特征缩放是否适当?
□ 是否创建了领域相关的交互特征?
□ 是否进行了特征选择?
□ 是否验证了特征的有效性?
快速探索特征重要性¶
Python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
importances = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(importances.head(20))
# 可视化
import matplotlib.pyplot as plt
plt.barh(importances['feature'][:20],
importances['importance'][:20])
plt.gca().invert_yaxis()
plt.show()
领域知识驱动¶
Text Only
关键问题:
1. 这个特征是否有业务意义?
2. 特征之间的关系是什么?
3. 是否遗漏了重要的衍生特征?
4. 是否有时间趋势或周期性?
例子: 预测房价
- ❌ 只用原始特征: 面积、房间数
- ✅ 添加衍生特征:
- 每平米价格 = 价格 / 面积
- 房龄 = 当前年份 - 建造年份
- 距离市中心距离
- 周边设施密度
- 学区评分
下一步: 学习 09-深度学习进阶.md,掌握更多神经网络架构!