📖 第4章:文本分类¶
学习时间:8小时 难度星级:⭐⭐⭐ 前置知识:文本表示方法、机器学习基础、PyTorch基础 学习目标:掌握从朴素贝叶斯到BERT的文本分类方法,能独立完成情感分析实战
📋 目录¶
- 1. 文本分类概述
- 2. 朴素贝叶斯文本分类
- 3. SVM文本分类
- 4. TextCNN
- 5. RNN/LSTM文本分类
- 6. BERT文本分类
- 7. 多标签分类
- 8. 情感分析实战
- 9. 面试要点
- 10. 练习题
1. 文本分类概述¶
1.1 什么是文本分类¶
文本分类是NLP中最基础、最常见的任务:给定一段文本,将其归入预定义的类别中。
text_classification_tasks = {
"情感分析": {
"输入": "这家餐厅的菜品非常好吃,服务也很周到",
"输出": "正面",
"场景": "电商评论、社交媒体分析",
},
"新闻分类": {
"输入": "中国队在世界杯预选赛中以3:1击败对手",
"输出": "体育",
"场景": "新闻推荐、内容分发",
},
"意图识别": {
"输入": "帮我查一下明天北京的天气",
"输出": "天气查询",
"场景": "智能助手、客服机器人",
},
"垃圾检测": {
"输入": "恭喜您中了100万大奖,点击领取",
"输出": "垃圾信息",
"场景": "邮件过滤、内容审核",
},
"情绪识别": {
"输入": "今天被老板骂了,心情很糟糕",
"输出": "愤怒/悲伤",
"场景": "心理健康监测",
},
}
1.2 文本分类的技术演进¶
传统方法 深度学习方法 预训练模型
(2000s) (2014-2017) (2018-今)
朴素贝叶斯 ──→ TextCNN ──→ BERT微调
SVM ──→ LSTM/BiLSTM ──→ RoBERTa
逻辑回归 ──→ Attention ──→ GPT Prompt
随机森林 RCNN 大模型 Zero-shot
特征: TF-IDF 特征: Word2Vec/GloVe 特征: 预训练表示
1.3 文本分类Pipeline¶
文本输入
│
▼
┌──────────────┐
│ 数据预处理 │ 清洗、分词、编码
└──────────────┘
│
▼
┌──────────────┐
│ 特征提取 │ TF-IDF / 词向量 / BERT编码
└──────────────┘
│
▼
┌──────────────┐
│ 模型训练 │ NB / SVM / CNN / LSTM / BERT
└──────────────┘
│
▼
┌──────────────┐
│ 模型评估 │ Accuracy / F1 / AUC
└──────────────┘
│
▼
┌──────────────┐
│ 模型部署 │ API / 在线服务
└──────────────┘
2. 朴素贝叶斯文本分类¶
2.1 原理¶
基于贝叶斯定理和特征独立性假设:
import numpy as np
from collections import Counter, defaultdict
import math
class MultinomialNaiveBayes:
"""多项式朴素贝叶斯文本分类器"""
def __init__(self, alpha=1.0):
self.alpha = alpha # 拉普拉斯平滑参数
self.class_log_prior = {}
self.feature_log_prob = {}
self.classes = []
self.vocab = set()
def fit(self, X, y):
"""
X: 分词后的文本列表 [["词1", "词2"], ...]
y: 标签列表
"""
self.classes = list(set(y))
n_docs = len(X)
# 构建词汇表
for doc in X:
self.vocab.update(doc)
vocab_size = len(self.vocab)
self.word2idx = {w: i for i, w in enumerate(sorted(self.vocab))} # enumerate同时获取索引和元素
for c in self.classes:
# 先验概率 P(c)
c_docs = [X[i] for i in range(n_docs) if y[i] == c]
self.class_log_prior[c] = math.log(len(c_docs) / n_docs)
# 似然概率 P(w|c)
word_counts = Counter() # Counter统计元素出现次数
total_words = 0
for doc in c_docs:
word_counts.update(doc)
total_words += len(doc)
# 拉普拉斯平滑
self.feature_log_prob[c] = {}
for word in self.vocab:
count = word_counts.get(word, 0)
self.feature_log_prob[c][word] = math.log(
(count + self.alpha) / (total_words + self.alpha * vocab_size)
)
def predict(self, X):
return [self._predict_single(doc) for doc in X]
def _predict_single(self, doc):
scores = {}
for c in self.classes:
score = self.class_log_prior[c]
for word in doc:
if word in self.feature_log_prob[c]:
score += self.feature_log_prob[c][word]
scores[c] = score
return max(scores, key=scores.get)
def predict_proba(self, doc):
"""返回各类别概率"""
scores = {}
for c in self.classes:
score = self.class_log_prior[c]
for word in doc:
if word in self.feature_log_prob[c]:
score += self.feature_log_prob[c][word]
scores[c] = score
# Softmax归一化
max_score = max(scores.values())
exp_scores = {c: math.exp(s - max_score) for c, s in scores.items()}
total = sum(exp_scores.values())
return {c: s / total for c, s in exp_scores.items()}
# 训练和测试
import jieba
train_data = [
("这部电影太好看了强烈推荐", "正面"),
("非常喜欢剧情很精彩", "正面"),
("拍得真好演技在线", "正面"),
("画面精美值得一看", "正面"),
("好看推荐大家去看", "正面"),
("很感动的一部电影", "正面"),
("太难看了浪费时间", "负面"),
("剧情太差不推荐", "负面"),
("很无聊看了一半就关了", "负面"),
("垃圾电影浪费钱", "负面"),
("太差了别看", "负面"),
("无聊透顶完全看不下去", "负面"),
]
X_train = [jieba.lcut(text) for text, _ in train_data]
y_train = [label for _, label in train_data]
nb = MultinomialNaiveBayes(alpha=1.0)
nb.fit(X_train, y_train)
# 测试
test_texts = [
"这个电影很好看",
"太无聊了不推荐",
"剧情精彩强烈推荐",
"浪费时间垃圾电影",
]
for text in test_texts:
words = jieba.lcut(text)
pred = nb._predict_single(words)
proba = nb.predict_proba(words)
print(f"'{text}' → {pred} (正面:{proba.get('正面',0):.3f}, 负面:{proba.get('负面',0):.3f})")
3. SVM文本分类¶
3.1 SVM用于文本分类¶
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import jieba
# 准备更多数据
train_texts = [
"这部电影太好看了", "非常喜欢剧情精彩", "演技出色推荐",
"画面精美值得观看", "导演功力深厚", "好看感人至深",
"太难看了浪费时间", "剧情很差不推荐", "很无聊看不下去",
"垃圾电影别看", "太差了失望透顶", "无聊至极",
"还行一般般吧", "中规中矩没特色", "普普通通",
]
train_labels = ["正面"]*6 + ["负面"]*6 + ["中性"]*3
# 中文分词后用空格连接
train_texts_seg = [' '.join(jieba.lcut(t)) for t in train_texts]
# TF-IDF特征
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_texts_seg)
# 训练SVM
svm = LinearSVC(C=1.0, max_iter=1000)
svm.fit(X_train, train_labels)
# 测试
test_texts = ["好看推荐", "太差了", "还可以吧"]
test_texts_seg = [' '.join(jieba.lcut(t)) for t in test_texts]
X_test = vectorizer.transform(test_texts_seg)
predictions = svm.predict(X_test)
for text, pred in zip(test_texts, predictions): # zip按位置配对
print(f"'{text}' → {pred}")
4. TextCNN¶
4.1 TextCNN架构¶
Kim (2014) 提出的TextCNN是CNN在文本分类中的经典应用。
输入文本: "自然 语言 处理 技术 很 有趣"
│
▼
┌────────────────────────────────┐
│ Embedding Layer │ 每个词 → d维向量
│ [v_自然 v_语言 v_处理 v_技术 v_很 v_有趣] │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 多尺度卷积层 │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │size=2│ │size=3│ │size=4│ │ 不同大小的卷积核
│ │n=100 │ │n=100 │ │n=100 │ │ 捕捉不同长度的n-gram
│ └──────┘ └──────┘ └──────┘ │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ Max-over-time Pooling │ 每个特征图取最大值
│ → 100 + 100 + 100 = 300维 │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 全连接层 + Softmax │ 输出类别概率
│ 300 → num_classes │
└────────────────────────────────┘
4.2 PyTorch实现TextCNN¶
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class TextCNN(nn.Module): # 继承nn.Module定义网络层
"""TextCNN文本分类模型"""
def __init__(self, vocab_size, embedding_dim, num_classes,
num_filters=100, filter_sizes=(2, 3, 4), dropout=0.5,
pretrained_embeddings=None):
super(TextCNN, self).__init__()
# 嵌入层
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
if pretrained_embeddings is not None:
self.embedding.weight.data.copy_(torch.FloatTensor(pretrained_embeddings))
self.embedding.weight.requires_grad = True # 是否微调词向量
# 多尺度卷积层
self.convs = nn.ModuleList([
nn.Conv1d(embedding_dim, num_filters, kernel_size=fs)
for fs in filter_sizes
])
# Dropout
self.dropout = nn.Dropout(dropout)
# 全连接层
self.fc = nn.Linear(len(filter_sizes) * num_filters, num_classes)
def forward(self, x):
"""
x: (batch_size, seq_len)
"""
# Embedding: (batch_size, seq_len, embedding_dim)
embedded = self.embedding(x)
# 转换维度以适配Conv1d: (batch_size, embedding_dim, seq_len)
embedded = embedded.permute(0, 2, 1)
# 多尺度卷积 + 池化
conv_outputs = []
for conv in self.convs:
# 卷积: (batch_size, num_filters, seq_len - kernel_size + 1)
c = F.relu(conv(embedded)) # F.xxx PyTorch函数式API
# 最大池化: (batch_size, num_filters)
c = F.max_pool1d(c, c.size(2)).squeeze(2) # squeeze压缩维度
conv_outputs.append(c)
# 拼接: (batch_size, num_filters * len(filter_sizes))
out = torch.cat(conv_outputs, dim=1) # torch.cat沿已有维度拼接张量
# Dropout + 全连接
out = self.dropout(out)
out = self.fc(out)
return out
# ==================
# 数据准备和训练
# ==================
class TextDataset:
"""文本数据集"""
def __init__(self, texts, labels, word2idx, max_len=50):
self.texts = texts
self.labels = labels
self.word2idx = word2idx
self.max_len = max_len
def __len__(self): # __len__定义len()行为
return len(self.texts)
def encode(self, words):
"""将词列表转为索引"""
indices = [self.word2idx.get(w, 1) for w in words] # 1 = UNK
# 填充或截断
if len(indices) < self.max_len:
indices += [0] * (self.max_len - len(indices)) # 0 = PAD
else:
indices = indices[:self.max_len]
return indices
def get_batch(self, batch_size=32):
"""获取批次数据"""
indices = np.random.choice(len(self.texts), batch_size, replace=True)
batch_x = [self.encode(self.texts[i]) for i in indices]
batch_y = [self.labels[i] for i in indices]
return torch.LongTensor(batch_x), torch.LongTensor(batch_y)
# 准备数据
import jieba
train_data = [
("这部电影太好看了", 1), ("非常喜欢推荐", 1), ("精彩绝伦好评", 1),
("画面很美剧情打动人", 1), ("演技出色故事感人", 1), ("好看强烈推荐", 1),
("太难看了不推荐", 0), ("剧情太差无聊", 0), ("垃圾电影别看", 0),
("浪费时间很差", 0), ("失望透顶太烂了", 0), ("看不下去", 0),
] * 10 # 重复增加数据量
texts_tokenized = [jieba.lcut(text) for text, _ in train_data]
labels = [label for _, label in train_data]
# 建立词汇表
vocab = set()
for tokens in texts_tokenized:
vocab.update(tokens)
word2idx = {"<PAD>": 0, "<UNK>": 1}
for i, word in enumerate(sorted(vocab), start=2):
word2idx[word] = i
vocab_size = len(word2idx)
label2idx = {0: 0, 1: 1}
# 创建数据集
dataset = TextDataset(texts_tokenized, labels, word2idx, max_len=20)
# 创建模型
model = TextCNN(
vocab_size=vocab_size,
embedding_dim=64,
num_classes=2,
num_filters=32,
filter_sizes=(2, 3, 4),
dropout=0.3,
)
# 训练
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
model.train() # train()训练模式
for epoch in range(20):
batch_x, batch_y = dataset.get_batch(32)
output = model(batch_x)
loss = criterion(output, batch_y)
optimizer.zero_grad() # 清零梯度
loss.backward() # 反向传播计算梯度
optimizer.step() # 更新参数
if (epoch + 1) % 5 == 0:
pred = output.argmax(dim=1)
acc = (pred == batch_y).float().mean()
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}, Acc: {acc.item():.4f}") # 将单元素张量转为Python数值
# 预测
model.eval()
test_texts = ["好看推荐大家看", "太差了别浪费时间"]
for text in test_texts:
words = jieba.lcut(text)
indices = dataset.encode(words)
x = torch.LongTensor([indices])
with torch.no_grad(): # 禁用梯度计算,节省内存
output = model(x)
pred = output.argmax(dim=1).item()
prob = F.softmax(output, dim=1)[0]
sentiment = "正面" if pred == 1 else "负面"
print(f"'{text}' → {sentiment} (正面概率: {prob[1]:.3f})")
5. RNN/LSTM文本分类¶
5.1 LSTM文本分类¶
class LSTMClassifier(nn.Module):
"""LSTM文本分类模型"""
def __init__(self, vocab_size, embedding_dim, hidden_dim,
num_classes, num_layers=2, bidirectional=True, dropout=0.5):
super(LSTMClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(
embedding_dim, hidden_dim,
num_layers=num_layers,
bidirectional=bidirectional,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
)
self.dropout = nn.Dropout(dropout)
# 双向LSTM输出维度翻倍
fc_dim = hidden_dim * 2 if bidirectional else hidden_dim
self.fc = nn.Linear(fc_dim, num_classes)
def forward(self, x):
# x: (batch_size, seq_len)
embedded = self.embedding(x) # (batch_size, seq_len, embedding_dim)
# LSTM
output, (hidden, cell) = self.lstm(embedded)
# output: (batch_size, seq_len, hidden_dim * 2)
# 方法1:取最后一个时间步(简单但有效)
# 方法2:平均池化
# 方法3:注意力池化
# 这里使用平均池化
pooled = output.mean(dim=1) # (batch_size, hidden_dim * 2)
out = self.dropout(pooled)
out = self.fc(out)
return out
# 创建BiLSTM模型
lstm_model = LSTMClassifier(
vocab_size=vocab_size,
embedding_dim=64,
hidden_dim=128,
num_classes=2,
num_layers=2,
bidirectional=True,
dropout=0.3,
)
print(f"模型参数量: {sum(p.numel() for p in lstm_model.parameters()):,}")
print(lstm_model)
5.2 LSTM + Attention¶
class AttentionLayer(nn.Module):
"""简单的注意力机制"""
def __init__(self, hidden_dim):
super(AttentionLayer, self).__init__()
self.attention = nn.Linear(hidden_dim, 1)
def forward(self, lstm_output):
# lstm_output: (batch_size, seq_len, hidden_dim)
attention_weights = torch.softmax(
self.attention(lstm_output).squeeze(-1), dim=1
)
# (batch_size, seq_len)
# 加权求和
weighted = torch.bmm(
attention_weights.unsqueeze(1), lstm_output # unsqueeze增加一个维度
).squeeze(1)
# (batch_size, hidden_dim)
return weighted, attention_weights
class LSTMAttentionClassifier(nn.Module):
"""LSTM + Attention 文本分类"""
def __init__(self, vocab_size, embedding_dim, hidden_dim,
num_classes, bidirectional=True, dropout=0.5):
super().__init__() # super()调用父类方法
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
bidirectional=bidirectional)
fc_dim = hidden_dim * 2 if bidirectional else hidden_dim
self.attention = AttentionLayer(fc_dim)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(fc_dim, num_classes)
def forward(self, x):
embedded = self.embedding(x)
lstm_out, _ = self.lstm(embedded)
attn_out, attn_weights = self.attention(lstm_out)
out = self.dropout(attn_out)
out = self.fc(out)
return out, attn_weights
print("LSTM + Attention模型创建成功")
6. BERT文本分类¶
6.1 BERT微调原理¶
BERT微调流程:
输入: [CLS] 这 部 电 影 很 好 看 [SEP]
│
▼
┌──────────────────────┐
│ BERT Encoder │ 12层 Transformer
│ (预训练参数) │
└──────────────────────┘
│
▼
[CLS] 的输出向量 (768维)
│
▼
┌──────────────────────┐
│ 分类头 │ Linear(768, num_classes)
│ (随机初始化) │
└──────────────────────┘
│
▼
类别概率 [0.9, 0.1] → 正面
6.2 使用Hugging Face实现BERT分类¶
# pip install transformers datasets
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
# ==================
# 1. 加载预训练模型和分词器
# ==================
model_name = "bert-base-chinese" # 中文BERT
# 加载分词器
tokenizer = BertTokenizer.from_pretrained(model_name)
# 分词示例
text = "这部电影太好看了"
tokens = tokenizer(text, padding='max_length', truncation=True,
max_length=32, return_tensors='pt')
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention Mask: {tokens['attention_mask']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")
# ==================
# 2. 构建数据集
# ==================
class SentimentDataset(torch.utils.data.Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx): # __getitem__定义索引访问行为
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(
text,
padding='max_length',
truncation=True,
max_length=self.max_len,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(label, dtype=torch.long),
}
# 准备数据
train_texts = [
"这部电影太好看了", "非常喜欢推荐", "精彩绝伦",
"画面很美", "演技出色", "好看感人",
"太难看了", "剧情太差", "垃圾电影",
"浪费时间", "很失望", "看不下去",
]
train_labels = [1]*6 + [0]*6
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
# ==================
# 3. 加载BERT模型
# ==================
model = BertForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
)
# ==================
# 4. 训练配置和训练
# ==================
# 简单的训练循环(不使用Trainer的方式)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
model.train()
for epoch in range(3):
total_loss = 0
for i in range(0, len(train_texts), 4):
batch_texts = train_texts[i:i+4]
batch_labels = train_labels[i:i+4]
encoding = tokenizer(
batch_texts,
padding=True,
truncation=True,
max_length=64,
return_tensors='pt',
)
encoding['labels'] = torch.tensor(batch_labels)
outputs = model(**encoding)
loss = outputs.loss
total_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")
# ==================
# 5. 预测
# ==================
model.eval()
test_texts = ["这个电影很好看强烈推荐", "太差了别浪费时间"]
for text in test_texts:
encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**encoding)
logits = outputs.logits
pred = logits.argmax(dim=1).item()
probs = torch.softmax(logits, dim=1)[0]
sentiment = "正面" if pred == 1 else "负面"
print(f"'{text}' → {sentiment} (正面概率: {probs[1]:.3f})")
6.3 BERT微调最佳实践¶
# BERT微调的关键超参数
best_practices = {
"学习率": "2e-5 到 5e-5(比普通深度学习小很多)",
"Batch Size": "16 或 32(显存允许的情况下尽量大)",
"训练轮数": "2-4 epochs(过多容易过拟合)",
"最大序列长度": "128 或 256(根据任务调整)",
"优化器": "AdamW(带权重衰减的Adam)",
"学习率调度": "线性warmup + 线性衰减",
"Dropout": "保持BERT默认的0.1",
"梯度裁剪": "最大梯度范数1.0",
}
print("BERT微调最佳实践:")
for key, value in best_practices.items():
print(f" 📌 {key}: {value}")
7. 多标签分类¶
class MultiLabelClassifier(nn.Module):
"""多标签文本分类"""
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_labels)
def forward(self, x):
embedded = self.embedding(x)
output, _ = self.lstm(embedded)
pooled = output.mean(dim=1)
logits = self.fc(pooled)
return logits # 注意:不用softmax,用sigmoid
# 多标签分类使用BCEWithLogitsLoss
# criterion = nn.BCEWithLogitsLoss()
# 示例:一篇新闻可能同时属于"科技"和"财经"
multi_label_example = {
"文本": "苹果公司发布了新款iPhone,股价应声上涨5%",
"标签": {"科技": 1, "财经": 1, "体育": 0, "娱乐": 0},
}
print(f"多标签示例: {multi_label_example}")
8. 情感分析实战¶
8.1 完整的情感分析项目¶
"""
情感分析实战:中文电影评论情感分类
使用多种方法进行对比
"""
import numpy as np
from collections import Counter
import jieba
import re
# ==================
# 数据准备
# ==================
# 模拟数据集
positive_reviews = [
"非常好看的一部电影,剧情紧凑,演技在线",
"强烈推荐这部电影,看完让人深思",
"画面精美,导演功力深厚,五星好评",
"这是今年看过最好的电影,值得二刷",
"笑中带泪,很感人的一部作品",
"特效太震撼了,值得去IMAX看",
"演员的演技都很棒,剧情也很吸引人",
"编剧太厉害了,每个角色都立住了",
"看完心情大好,强烈推荐给朋友们",
"国产电影的骄傲,必须支持",
"剧情出人意料,结局很惊艳",
"配乐非常好听,为电影增色不少",
]
negative_reviews = [
"太难看了,浪费两个小时",
"剧情老套,毫无新意",
"演技尴尬,看不下去",
"特效五毛钱,剧情一塌糊涂",
"这是我看过最差的电影",
"导演在想什么,完全不知所云",
"无聊至极,全程想睡觉",
"烂片预警,千万别买票",
"浪费钱浪费时间,后悔看了",
"演员演技太差了,剧情也不行",
"失望透顶,还不如在家看电视",
"逻辑混乱剧情崩塌的烂片",
]
all_texts = positive_reviews + negative_reviews
all_labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)
# 文本预处理
def preprocess(text):
text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', '', text)
words = jieba.lcut(text)
return words
tokenized_texts = [preprocess(t) for t in all_texts]
# 划分训练集和测试集
np.random.seed(42)
indices = np.random.permutation(len(all_texts))
split = int(0.8 * len(indices))
train_idx, test_idx = indices[:split], indices[split:]
X_train = [tokenized_texts[i] for i in train_idx]
y_train = [all_labels[i] for i in train_idx]
X_test = [tokenized_texts[i] for i in test_idx]
y_test = [all_labels[i] for i in test_idx]
# ==================
# 方法1: 朴素贝叶斯
# ==================
nb_clf = MultinomialNaiveBayes(alpha=1.0)
nb_clf.fit(X_train, y_train)
nb_preds = nb_clf.predict(X_test)
nb_acc = sum(p == t for p, t in zip(nb_preds, y_test)) / len(y_test)
print("="*50)
print("情感分析方法对比")
print("="*50)
print(f"\n1. 朴素贝叶斯准确率: {nb_acc:.4f}")
# ==================
# 方法2: TF-IDF + SVM (使用sklearn)
# ==================
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
train_texts_str = [' '.join(t) for t in X_train]
test_texts_str = [' '.join(t) for t in X_test]
vectorizer = TfidfVectorizer(max_features=3000)
X_train_tfidf = vectorizer.fit_transform(train_texts_str)
X_test_tfidf = vectorizer.transform(test_texts_str)
svm_clf = LinearSVC(C=1.0)
svm_clf.fit(X_train_tfidf, y_train)
svm_preds = svm_clf.predict(X_test_tfidf)
svm_acc = sum(p == t for p, t in zip(svm_preds, y_test)) / len(y_test)
print(f"2. TF-IDF + SVM准确率: {svm_acc:.4f}")
print(f"\n测试样本预测结果:")
for i, idx in enumerate(test_idx):
text = all_texts[idx]
true_label = "正面" if all_labels[idx] == 1 else "负面"
nb_label = "正面" if nb_preds[i] == 1 else "负面"
svm_label = "正面" if svm_preds[i] == 1 else "负面"
status = "✓" if nb_preds[i] == all_labels[idx] else "✗"
print(f" {status} '{text[:15]}...' 真实:{true_label} NB:{nb_label} SVM:{svm_label}") # 切片操作,取前n个元素
9. 面试要点¶
🔑 面试高频考点
考点1:TextCNN的原理和为什么有效?¶
✅ 标准答案要点:
1. 架构:嵌入层 → 多尺度卷积 → 最大池化 → 全连接 → Softmax
2. 为什么有效:
- 不同大小的卷积核捕捉不同长度的n-gram特征
- Max Pooling捕捉最重要的特征,忽略位置信息
- 参数量少,训练速度快
3. 关键超参数:
- 卷积核大小:通常(2,3,4)或(3,4,5)
- 卷积核数量:100-300
- Dropout:0.5
4. 局限性:
- 只能捕捉局部特征(卷积核范围内)
- 不适合长距离依赖
考点2:BERT微调文本分类的流程?¶
✅ 标准答案要点:
1. 输入处理:[CLS] + tokens + [SEP],WordPiece分词
2. 使用[CLS]对应的输出向量作为文本表示
3. 在[CLS]向量上加一个线性分类头
4. 微调整个BERT + 分类头(端到端)
5. 关键超参数:
- 学习率:2e-5~5e-5
- Batch Size:16/32
- Epochs:2-4
- Warmup比例:10%
6. 注意事项:
- 学习率不能太大(破坏预训练参数)
- 不需要太多epoch(防止过拟合)
考点3:文本分类中如何处理类别不均衡?¶
✅ 标准答案要点:
数据层面:
1. 过采样少类(SMOTE、随机过采样)
2. 欠采样多类(随机删除)
3. 数据增强(同义词替换、回译)
模型层面:
1. 类别权重:在loss中设置class_weight
2. Focal Loss:聚焦难分类样本
3. 代价敏感学习
评估层面:
1. 使用Macro F1而非Accuracy
2. 关注少类的Recall
3. 使用AUC-ROC
实战经验:
- 倾斜比例不大(<10:1):类别权重即可
- 严重不均衡(>100:1):考虑将问题转为异常检测
考点4:如何提升文本分类的效果?¶
✅ 标准答案要点:
数据层面:
- 增加标注数据
- 数据增强(EDA、回译、对抗样本)
- 数据清洗和一致性检查
模型层面:
- 使用预训练模型(BERT/RoBERTa)
- 模型集成(投票、Stacking)
- 领域预训练(在领域数据上继续预训练)
训练技巧:
- 学习率warmup和衰减
- 标签平滑(Label Smoothing)
- 对抗训练(FGM、PGD)
- R-Drop正则化
特征层面:
- 引入额外特征(文本长度、特殊词频等)
- 多任务学习
10. 练习题¶
📝 基础题¶
- 解释朴素贝叶斯中"朴素"的含义,以及这个假设在文本分类中是否合理。
答案:"朴素"指条件独立性假设——给定类别下各特征(词)相互独立,即P(w1,w2,...|c)=P(w1|c)·P(w2|c)·...。这在文本中不完全合理,因为词间存在强依赖(如"人工"与"智能"常共现)。但实际效果好:①分类只需比较P(c|x)的相对大小,误差对各类别影响类似;②高维稀疏空间中相关性影响被稀释;③训练快、不易过拟合,在小数据集上表现稳健。
- 比较TextCNN和TextRNN在文本分类中的优缺点。
答案:TextCNN优点:并行计算速度快;多尺度卷积核能捕捉不同长度的局部n-gram特征;实现简单。缺点:只能捕捉局部特征,难以建模长距离依赖。TextRNN(LSTM/GRU)优点:能建模长距离依赖和序列关系;对词序敏感,适合需要上下文理解的任务。缺点:串行计算训练速度慢;长文本仍有梯度问题;模型较复杂。实践建议:短文本分类优选TextCNN,长文本或需上下文理解的任务选BiLSTM。
💻 编程题¶
-
使用Scikit-learn实现一个完整的多类别新闻分类器(体育、科技、财经、娱乐),使用TF-IDF + 多种分类器对比。
-
实现一个TextCNN模型,在中文情感分析数据集上训练并评估。
-
使用Hugging Face的BERT模型完成一个意图识别任务。
🔬 思考题¶
- 在大模型时代,是否还需要单独训练文本分类模型?什么情况下应该使用BERT微调而不是直接用GPT?
答案:仍需要。用BERT微调的场景:①高并发/低延迟:BERT-base仅110M参数,推理远快于GPT级大模型;②数据敏感:企业数据不能调用外部API;③垂直领域高精度:微调模型精度更高;④成本敏感:大模型API批量分类不经济。用GPT更好的场景:①无标注数据时零样本分类;②多任务统一处理;③快速原型验证无需训练。
✅ 自我检查清单¶
□ 我理解朴素贝叶斯在文本分类中的原理
□ 我能手写TextCNN的PyTorch代码
□ 我知道BERT微调分类的完整流程和关键超参数
□ 我能区分多标签和多分类问题
□ 我知道如何处理类别不均衡问题
□ 我完成了情感分析实战代码
□ 我完成了至少3道练习题
📚 延伸阅读¶
- TextCNN原始论文: Convolutional Neural Networks for Sentence Classification
- BERT论文: Pre-training of Deep Bidirectional Transformers
- Hugging Face文本分类教程
- 中文情感分析综述
下一篇:05-序列标注 — 命名实体识别与词性标注