跳转至

08 - 特征工程

特征工程图

🎯 为什么特征工程如此重要?

Text Only
Garbage In, Garbage Out

好特征 + 简单模型 > 差特征 + 复杂模型

特征工程的回报:
- 80%的建模时间花在特征工程上
- 好的特征可以让简单模型达到SOTA效果
- 是区分数据科学家水平的关键

📊 特征类型

数值特征 (Numerical)

Text Only
连续型: 年龄、身高、温度、收入
离散型: 家庭人数、购买次数、等级

类别特征 (Categorical)

Text Only
名义型 (无序): 颜色、城市、产品ID
有序型 (有序): 评级(低中高)、教育程度

时间特征 (Temporal)

Text Only
时间戳: 2024-01-15 10:30:00
衍生: 星期几、月份、季度、是否工作日

文本特征 (Text)

Text Only
原始: "This product is amazing!"
处理: 词袋模型、TF-IDF、词嵌入

地理特征 (Geospatial)

Text Only
坐标: (纬度, 经度)
衍生: 距离、区域、热力图

🔧 数值特征处理

1. 缺失值处理

识别缺失值:

Python
import pandas as pd
import numpy as np

# 查看缺失情况
df.isnull().sum()
df.isnull().mean()  # 缺失比例

# 可视化
import missingno as msno
msno.matrix(df)
msno.bar(df)

处理策略:

删除

Python
# 删除缺失值过多的行(使用赋值替代inplace=True)
df = df.dropna(thresh=len(df.columns) * 0.7)

# 删除缺失值过多的列(使用赋值替代inplace=True)
df = df.dropna(thresh=len(df) * 0.7, axis=1)

填充

Python
# 均值填充 (正态分布)
df['column'] = df['column'].fillna(df['column'].mean())

# 中位数填充 (有异常值)
df['column'] = df['column'].fillna(df['column'].median())

# 众数填充 (类别特征)
df['column'] = df['column'].fillna(df['column'].mode()[0])

# 前向填充 (时序数据)
df['column'] = df['column'].ffill()

# 插值
df['column'] = df['column'].interpolate(method='linear')

# 标记缺失 (保留信息)
df['column_missing'] = df['column'].isnull().astype(int)
df['column'] = df['column'].fillna(-999)

高级方法

Python
# KNN填充
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)

# 迭代填充 (MICE)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_filled = imputer.fit_transform(df)

2. 异常值检测与处理

检测方法:

箱线图 (IQR方法)

Python
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (series < lower_bound) | (series > upper_bound)

outliers = detect_outliers_iqr(df['column'])
print(f"异常值数量: {outliers.sum()}")

Z-Score方法

Python
from scipy import stats

z_scores = np.abs(stats.zscore(df['column']))
outliers = z_scores > 3  # 3个标准差

可视化

Python
import matplotlib.pyplot as plt

# 箱线图
plt.boxplot(df['column'])
plt.show()

# 直方图
plt.hist(df['column'], bins=50)
plt.show()

# 散点图 (多变量)
plt.scatter(df['feature1'], df['feature2'])
plt.show()

处理策略:

Python
# 删除
df = df[~outliers]

# 盖帽法 (Capping)
upper_limit = df['column'].quantile(0.99)
lower_limit = df['column'].quantile(0.01)
df['column'] = df['column'].clip(lower_limit, upper_limit)

# 对数变换 (偏态数据)
df['column'] = np.log1p(df['column'])  # log(1+x)

# Box-Cox变换
from scipy import stats
transformed, lambda_param = stats.boxcox(df['column'] + 1)


3. 特征缩放

为什么需要缩放? - 梯度下降收敛更快 - 距离相关算法 (KNN, K-Means, SVM) - 神经网络

标准化 (Z-Score Normalization)

Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 结果: 均值=0, 标准差=1
# 适用于: 正态分布数据

最小-最大缩放

Python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)

# 结果: [0, 1]区间
# 适用于: 神经网络,图像像素

鲁棒缩放 (Robust Scaler)

Python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# 使用中位数和四分位数,对异常值鲁棒
# 适用于: 有异常值的数据

最大值缩放 (MaxAbsScaler)

Python
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)

# 缩放到[-1, 1],保留稀疏性
# 适用于: 稀疏数据

重要: 测试集使用训练集的scaler!

Python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # 注意: 只transform!


4. 特征变换

对数变换

Python
# 右偏数据 → 更接近正态分布
df['log_feature'] = np.log1p(df['feature'])

# 适用于: 收入、价格、计数数据

指数变换

Python
# 左偏数据
df['exp_feature'] = np.exp(df['feature'] / df['feature'].max())

Box-Cox变换 (自动选择λ)

Python
from scipy import stats

# 仅适用于正数
transformed, lambda_param = stats.boxcox(df['feature'])
print(f"最优λ: {lambda_param:.2f}")

# λ = 1: 不变
# λ = 0: 对数变换
# λ = -1: 倒数

Yeo-Johnson变换 (支持负数)

Python
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X)

🏷️ 类别特征处理

1. 标签编码 (Label Encoding)

Python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])

# 结果: 0, 1, 2, 3, ...
# 适用于: 有序类别 (低中高)
# 问题: 引入顺序关系 (城市大小无意义)

2. 独热编码 (One-Hot Encoding)

Python
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, drop='first')  # 避免多重共线性
encoded = encoder.fit_transform(df[['city']])

# 或用pandas
df = pd.get_dummies(df, columns=['city'], drop_first=True)

# 结果: city_Beijing, city_Shanghai, city_Guangzhou, ...
# 适用于: 名义类别,基数较小 (<10)

3. 目标编码 (Target Encoding)

Python
# 用目标变量的均值替代类别
mean_target = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(mean_target)

# 适用于: 高基数类别 (如1000个城市)
# 注意: 可能导致过拟合 (使用平滑或交叉验证)

4. 频率编码

Python
# 用类别出现频率
frequency = df['category'].value_counts(normalize=True)
df['category_freq'] = df['category'].map(frequency)

5. 嵌入编码 (Embedding)

Python
# 神经网络方法,将类别映射到低维连续空间
# 详见NLP章节的Word2Vec

🕐 时间特征工程

Python
import pandas as pd

# 转换为时间类型
df['date'] = pd.to_datetime(df['date'])

# 基础特征
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute

# 高级特征
df['quarter'] = df['date'].dt.quarter
df['weekofyear'] = df['date'].dt.isocalendar().week
df['dayofyear'] = df['date'].dt.dayofyear

# 是否特征
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)

# 周期性编码 (避免12月比1月"大")
def encode_cyclical(feature, max_val):
    return (
        np.sin(2 * np.pi * feature / max_val),
        np.cos(2 * np.pi * feature / max_val)
    )

df['month_sin'], df['month_cos'] = encode_cyclical(df['month'], 12)
df['dayofweek_sin'], df['dayofweek_cos'] = encode_cyclical(df['dayofweek'], 7)

# 时间差特征
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
df['days_to_end'] = (df['date'].max() - df['date']).dt.days

📝 文本特征工程

1. 基础统计特征

Python
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].str.len() / df['word_count']
df['uppercase_count'] = df['text'].str.count(r'[A-Z]')
df['exclamation_count'] = df['text'].str.count(r'!')
df['question_count'] = df['text'].str.count(r'\?')

2. TF-IDF

Python
from sklearn.feature_extraction.text import TfidfVectorizer

# 单词级别
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),  # unigram和bigram
    min_df=5,            # 至少出现5次
    max_df=0.8           # 最多出现在80%文档中
)

tfidf_matrix = vectorizer.fit_transform(df['text'])
feature_names = vectorizer.get_feature_names_out()

3. 词袋模型 (Bag of Words)

Python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    max_features=5000,
    stop_words='english',
    binary=True  # 只关心是否出现
)

bow_matrix = vectorizer.fit_transform(df['text'])

4. N-grams

Python
# Bigram (相邻两个词)
vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = vectorizer.fit_transform(df['text'])

🗺️ 特征生成与组合

1. 四则运算

Python
# 比率特征
df['price_per_sqft'] = df['price'] / df['sqft']
df['rooms_per_person'] = df['rooms'] / df['household_size']

# 差值
df['price_diff'] = df['current_price'] - df['original_price']

# 累积和
df['cumulative_sum'] = df['value'].cumsum()

2. 多项式特征

Python
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(
    degree=2,
    interaction_only=True,  # 只生成交互项
    include_bias=False
)

X_poly = poly.fit_transform(X)
# 生成: x1, x2, x1x2(interaction_only=True 排除了 x1², x2² 等自交互项)

3. 分桶 (Binning / Discretization)

Python
# 等宽分桶
df['age_bin'] = pd.cut(df['age'], bins=5, labels=False)

# 等频分桶
df['income_bin'] = pd.qcut(df['income'], q=4, labels=False)

# 自定义分桶
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 50, 100],
    labels=['儿童', '青年', '中年', '老年']
)

4. 交互特征

Python
# 类别交互
df['city_gender'] = df['city'] + '_' + df['gender']

# 数值交互
df['age_income'] = df['age'] * df['income']

📉 降维

PCA (主成分分析)

Python
from sklearn.decomposition import PCA

# 选择保留95%方差
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"原始维度: {X.shape[1]}")
print(f"降维后: {X_pca.shape[1]}")
print(f"解释方差比: {pca.explained_variance_ratio_.cumsum()}")

特征选择

Python
# 移除低方差特征
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

# 单变量选择
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# 基于模型的特征选择
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100),
    threshold='median'
)
X_selected = selector.fit_transform(X, y)

🔄 特征工程完整流程

Python
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# 定义列类型
numeric_features = ['age', 'income', 'score']
categorical_features = ['city', 'gender', 'education']

# 数值特征处理
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 类别特征处理
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 组合
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# 完整流程
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# 训练
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

💡 特征工程最佳实践

检查清单

Text Only
□ 缺失值是否正确处理?
□ 异常值是否检查和处理?
□ 类别特征是否正确编码?
□ 时间特征是否充分提取?
□ 特征缩放是否适当?
□ 是否创建了领域相关的交互特征?
□ 是否进行了特征选择?
□ 是否验证了特征的有效性?

快速探索特征重要性

Python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

importances = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importances.head(20))

# 可视化
import matplotlib.pyplot as plt
plt.barh(importances['feature'][:20],
         importances['importance'][:20])
plt.gca().invert_yaxis()
plt.show()

领域知识驱动

Text Only
关键问题:
1. 这个特征是否有业务意义?
2. 特征之间的关系是什么?
3. 是否遗漏了重要的衍生特征?
4. 是否有时间趋势或周期性?

例子: 预测房价
- ❌ 只用原始特征: 面积、房间数
- ✅ 添加衍生特征:
   - 每平米价格 = 价格 / 面积
   - 房龄 = 当前年份 - 建造年份
   - 距离市中心距离
   - 周边设施密度
   - 学区评分

下一步: 学习 09-深度学习进阶.md,掌握更多神经网络架构!