跳转至

📖 第4章:文本分类

文本分类

学习时间:8小时 难度星级:⭐⭐⭐ 前置知识:文本表示方法、机器学习基础、PyTorch基础 学习目标:掌握从朴素贝叶斯到BERT的文本分类方法,能独立完成情感分析实战


📋 目录


1. 文本分类概述

1.1 什么是文本分类

文本分类是NLP中最基础、最常见的任务:给定一段文本,将其归入预定义的类别中。

Python
text_classification_tasks = {
    "情感分析": {
        "输入": "这家餐厅的菜品非常好吃,服务也很周到",
        "输出": "正面",
        "场景": "电商评论、社交媒体分析",
    },
    "新闻分类": {
        "输入": "中国队在世界杯预选赛中以3:1击败对手",
        "输出": "体育",
        "场景": "新闻推荐、内容分发",
    },
    "意图识别": {
        "输入": "帮我查一下明天北京的天气",
        "输出": "天气查询",
        "场景": "智能助手、客服机器人",
    },
    "垃圾检测": {
        "输入": "恭喜您中了100万大奖,点击领取",
        "输出": "垃圾信息",
        "场景": "邮件过滤、内容审核",
    },
    "情绪识别": {
        "输入": "今天被老板骂了,心情很糟糕",
        "输出": "愤怒/悲伤",
        "场景": "心理健康监测",
    },
}

1.2 文本分类的技术演进

Text Only
传统方法                    深度学习方法               预训练模型
(2000s)                    (2014-2017)               (2018-今)

朴素贝叶斯     ──→        TextCNN        ──→        BERT微调
SVM           ──→        LSTM/BiLSTM    ──→        RoBERTa
逻辑回归       ──→        Attention      ──→        GPT Prompt
随机森林                   RCNN                      大模型 Zero-shot

特征: TF-IDF              特征: Word2Vec/GloVe       特征: 预训练表示

1.3 文本分类Pipeline

Text Only
文本输入
┌──────────────┐
│  数据预处理   │ 清洗、分词、编码
└──────────────┘
┌──────────────┐
│  特征提取     │ TF-IDF / 词向量 / BERT编码
└──────────────┘
┌──────────────┐
│  模型训练     │ NB / SVM / CNN / LSTM / BERT
└──────────────┘
┌──────────────┐
│  模型评估     │ Accuracy / F1 / AUC
└──────────────┘
┌──────────────┐
│  模型部署     │ API / 在线服务
└──────────────┘

2. 朴素贝叶斯文本分类

2.1 原理

基于贝叶斯定理和特征独立性假设:

\[P(c|d) = \frac{P(d|c) \cdot P(c)}{P(d)} \propto P(c) \prod_{i=1}^{n} P(w_i|c)\]
Python
import numpy as np
from collections import Counter, defaultdict
import math

class MultinomialNaiveBayes:
    """多项式朴素贝叶斯文本分类器"""

    def __init__(self, alpha=1.0):
        self.alpha = alpha  # 拉普拉斯平滑参数
        self.class_log_prior = {}
        self.feature_log_prob = {}
        self.classes = []
        self.vocab = set()

    def fit(self, X, y):
        """
        X: 分词后的文本列表 [["词1", "词2"], ...]
        y: 标签列表
        """
        self.classes = list(set(y))
        n_docs = len(X)

        # 构建词汇表
        for doc in X:
            self.vocab.update(doc)
        vocab_size = len(self.vocab)
        self.word2idx = {w: i for i, w in enumerate(sorted(self.vocab))}  # enumerate同时获取索引和元素

        for c in self.classes:
            # 先验概率 P(c)
            c_docs = [X[i] for i in range(n_docs) if y[i] == c]
            self.class_log_prior[c] = math.log(len(c_docs) / n_docs)

            # 似然概率 P(w|c)
            word_counts = Counter()  # Counter统计元素出现次数
            total_words = 0
            for doc in c_docs:
                word_counts.update(doc)
                total_words += len(doc)

            # 拉普拉斯平滑
            self.feature_log_prob[c] = {}
            for word in self.vocab:
                count = word_counts.get(word, 0)
                self.feature_log_prob[c][word] = math.log(
                    (count + self.alpha) / (total_words + self.alpha * vocab_size)
                )

    def predict(self, X):
        return [self._predict_single(doc) for doc in X]

    def _predict_single(self, doc):
        scores = {}
        for c in self.classes:
            score = self.class_log_prior[c]
            for word in doc:
                if word in self.feature_log_prob[c]:
                    score += self.feature_log_prob[c][word]
            scores[c] = score
        return max(scores, key=scores.get)

    def predict_proba(self, doc):
        """返回各类别概率"""
        scores = {}
        for c in self.classes:
            score = self.class_log_prior[c]
            for word in doc:
                if word in self.feature_log_prob[c]:
                    score += self.feature_log_prob[c][word]
            scores[c] = score

        # Softmax归一化
        max_score = max(scores.values())
        exp_scores = {c: math.exp(s - max_score) for c, s in scores.items()}
        total = sum(exp_scores.values())
        return {c: s / total for c, s in exp_scores.items()}

# 训练和测试
import jieba

train_data = [
    ("这部电影太好看了强烈推荐", "正面"),
    ("非常喜欢剧情很精彩", "正面"),
    ("拍得真好演技在线", "正面"),
    ("画面精美值得一看", "正面"),
    ("好看推荐大家去看", "正面"),
    ("很感动的一部电影", "正面"),
    ("太难看了浪费时间", "负面"),
    ("剧情太差不推荐", "负面"),
    ("很无聊看了一半就关了", "负面"),
    ("垃圾电影浪费钱", "负面"),
    ("太差了别看", "负面"),
    ("无聊透顶完全看不下去", "负面"),
]

X_train = [jieba.lcut(text) for text, _ in train_data]
y_train = [label for _, label in train_data]

nb = MultinomialNaiveBayes(alpha=1.0)
nb.fit(X_train, y_train)

# 测试
test_texts = [
    "这个电影很好看",
    "太无聊了不推荐",
    "剧情精彩强烈推荐",
    "浪费时间垃圾电影",
]

for text in test_texts:
    words = jieba.lcut(text)
    pred = nb._predict_single(words)
    proba = nb.predict_proba(words)
    print(f"'{text}' → {pred} (正面:{proba.get('正面',0):.3f}, 负面:{proba.get('负面',0):.3f})")

3. SVM文本分类

3.1 SVM用于文本分类

Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import jieba

# 准备更多数据
train_texts = [
    "这部电影太好看了", "非常喜欢剧情精彩", "演技出色推荐",
    "画面精美值得观看", "导演功力深厚", "好看感人至深",
    "太难看了浪费时间", "剧情很差不推荐", "很无聊看不下去",
    "垃圾电影别看", "太差了失望透顶", "无聊至极",
    "还行一般般吧", "中规中矩没特色", "普普通通",
]
train_labels = ["正面"]*6 + ["负面"]*6 + ["中性"]*3

# 中文分词后用空格连接
train_texts_seg = [' '.join(jieba.lcut(t)) for t in train_texts]

# TF-IDF特征
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_texts_seg)

# 训练SVM
svm = LinearSVC(C=1.0, max_iter=1000)
svm.fit(X_train, train_labels)

# 测试
test_texts = ["好看推荐", "太差了", "还可以吧"]
test_texts_seg = [' '.join(jieba.lcut(t)) for t in test_texts]
X_test = vectorizer.transform(test_texts_seg)
predictions = svm.predict(X_test)

for text, pred in zip(test_texts, predictions):  # zip按位置配对
    print(f"'{text}' → {pred}")

4. TextCNN

4.1 TextCNN架构

Kim (2014) 提出的TextCNN是CNN在文本分类中的经典应用。

Text Only
输入文本: "自然 语言 处理 技术 很 有趣"
┌────────────────────────────────┐
│  Embedding Layer               │  每个词 → d维向量
│  [v_自然 v_语言 v_处理 v_技术 v_很 v_有趣]  │
└────────────────────────────────┘
┌────────────────────────────────┐
│  多尺度卷积层                    │
│  ┌──────┐ ┌──────┐ ┌──────┐  │
│  │size=2│ │size=3│ │size=4│  │  不同大小的卷积核
│  │n=100 │ │n=100 │ │n=100 │  │  捕捉不同长度的n-gram
│  └──────┘ └──────┘ └──────┘  │
└────────────────────────────────┘
┌────────────────────────────────┐
│  Max-over-time Pooling         │  每个特征图取最大值
│  → 100 + 100 + 100 = 300维     │
└────────────────────────────────┘
┌────────────────────────────────┐
│  全连接层 + Softmax             │  输出类别概率
│  300 → num_classes             │
└────────────────────────────────┘

4.2 PyTorch实现TextCNN

Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class TextCNN(nn.Module):  # 继承nn.Module定义网络层
    """TextCNN文本分类模型"""

    def __init__(self, vocab_size, embedding_dim, num_classes,
                 num_filters=100, filter_sizes=(2, 3, 4), dropout=0.5,
                 pretrained_embeddings=None):
        super(TextCNN, self).__init__()

        # 嵌入层
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        if pretrained_embeddings is not None:
            self.embedding.weight.data.copy_(torch.FloatTensor(pretrained_embeddings))
            self.embedding.weight.requires_grad = True  # 是否微调词向量

        # 多尺度卷积层
        self.convs = nn.ModuleList([
            nn.Conv1d(embedding_dim, num_filters, kernel_size=fs)
            for fs in filter_sizes
        ])

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # 全连接层
        self.fc = nn.Linear(len(filter_sizes) * num_filters, num_classes)

    def forward(self, x):
        """
        x: (batch_size, seq_len)
        """
        # Embedding: (batch_size, seq_len, embedding_dim)
        embedded = self.embedding(x)

        # 转换维度以适配Conv1d: (batch_size, embedding_dim, seq_len)
        embedded = embedded.permute(0, 2, 1)

        # 多尺度卷积 + 池化
        conv_outputs = []
        for conv in self.convs:
            # 卷积: (batch_size, num_filters, seq_len - kernel_size + 1)
            c = F.relu(conv(embedded))  # F.xxx PyTorch函数式API
            # 最大池化: (batch_size, num_filters)
            c = F.max_pool1d(c, c.size(2)).squeeze(2)  # squeeze压缩维度
            conv_outputs.append(c)

        # 拼接: (batch_size, num_filters * len(filter_sizes))
        out = torch.cat(conv_outputs, dim=1)  # torch.cat沿已有维度拼接张量

        # Dropout + 全连接
        out = self.dropout(out)
        out = self.fc(out)

        return out

# ==================
# 数据准备和训练
# ==================

class TextDataset:
    """文本数据集"""

    def __init__(self, texts, labels, word2idx, max_len=50):
        self.texts = texts
        self.labels = labels
        self.word2idx = word2idx
        self.max_len = max_len

    def __len__(self):  # __len__定义len()行为
        return len(self.texts)

    def encode(self, words):
        """将词列表转为索引"""
        indices = [self.word2idx.get(w, 1) for w in words]  # 1 = UNK
        # 填充或截断
        if len(indices) < self.max_len:
            indices += [0] * (self.max_len - len(indices))  # 0 = PAD
        else:
            indices = indices[:self.max_len]
        return indices

    def get_batch(self, batch_size=32):
        """获取批次数据"""
        indices = np.random.choice(len(self.texts), batch_size, replace=True)
        batch_x = [self.encode(self.texts[i]) for i in indices]
        batch_y = [self.labels[i] for i in indices]
        return torch.LongTensor(batch_x), torch.LongTensor(batch_y)

# 准备数据
import jieba

train_data = [
    ("这部电影太好看了", 1), ("非常喜欢推荐", 1), ("精彩绝伦好评", 1),
    ("画面很美剧情打动人", 1), ("演技出色故事感人", 1), ("好看强烈推荐", 1),
    ("太难看了不推荐", 0), ("剧情太差无聊", 0), ("垃圾电影别看", 0),
    ("浪费时间很差", 0), ("失望透顶太烂了", 0), ("看不下去", 0),
] * 10  # 重复增加数据量

texts_tokenized = [jieba.lcut(text) for text, _ in train_data]
labels = [label for _, label in train_data]

# 建立词汇表
vocab = set()
for tokens in texts_tokenized:
    vocab.update(tokens)
word2idx = {"<PAD>": 0, "<UNK>": 1}
for i, word in enumerate(sorted(vocab), start=2):
    word2idx[word] = i

vocab_size = len(word2idx)
label2idx = {0: 0, 1: 1}

# 创建数据集
dataset = TextDataset(texts_tokenized, labels, word2idx, max_len=20)

# 创建模型
model = TextCNN(
    vocab_size=vocab_size,
    embedding_dim=64,
    num_classes=2,
    num_filters=32,
    filter_sizes=(2, 3, 4),
    dropout=0.3,
)

# 训练
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

model.train()  # train()训练模式
for epoch in range(20):
    batch_x, batch_y = dataset.get_batch(32)
    output = model(batch_x)
    loss = criterion(output, batch_y)

    optimizer.zero_grad()  # 清零梯度
    loss.backward()  # 反向传播计算梯度
    optimizer.step()  # 更新参数

    if (epoch + 1) % 5 == 0:
        pred = output.argmax(dim=1)
        acc = (pred == batch_y).float().mean()
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}, Acc: {acc.item():.4f}")  # 将单元素张量转为Python数值

# 预测
model.eval()
test_texts = ["好看推荐大家看", "太差了别浪费时间"]
for text in test_texts:
    words = jieba.lcut(text)
    indices = dataset.encode(words)
    x = torch.LongTensor([indices])
    with torch.no_grad():  # 禁用梯度计算,节省内存
        output = model(x)
        pred = output.argmax(dim=1).item()
        prob = F.softmax(output, dim=1)[0]
    sentiment = "正面" if pred == 1 else "负面"
    print(f"'{text}' → {sentiment} (正面概率: {prob[1]:.3f})")

5. RNN/LSTM文本分类

5.1 LSTM文本分类

Python
class LSTMClassifier(nn.Module):
    """LSTM文本分类模型"""

    def __init__(self, vocab_size, embedding_dim, hidden_dim,
                 num_classes, num_layers=2, bidirectional=True, dropout=0.5):
        super(LSTMClassifier, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim,
            num_layers=num_layers,
            bidirectional=bidirectional,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        self.dropout = nn.Dropout(dropout)

        # 双向LSTM输出维度翻倍
        fc_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_dim, num_classes)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)

        # LSTM
        output, (hidden, cell) = self.lstm(embedded)
        # output: (batch_size, seq_len, hidden_dim * 2)

        # 方法1:取最后一个时间步(简单但有效)
        # 方法2:平均池化
        # 方法3:注意力池化

        # 这里使用平均池化
        pooled = output.mean(dim=1)  # (batch_size, hidden_dim * 2)

        out = self.dropout(pooled)
        out = self.fc(out)
        return out

# 创建BiLSTM模型
lstm_model = LSTMClassifier(
    vocab_size=vocab_size,
    embedding_dim=64,
    hidden_dim=128,
    num_classes=2,
    num_layers=2,
    bidirectional=True,
    dropout=0.3,
)

print(f"模型参数量: {sum(p.numel() for p in lstm_model.parameters()):,}")
print(lstm_model)

5.2 LSTM + Attention

Python
class AttentionLayer(nn.Module):
    """简单的注意力机制"""

    def __init__(self, hidden_dim):
        super(AttentionLayer, self).__init__()
        self.attention = nn.Linear(hidden_dim, 1)

    def forward(self, lstm_output):
        # lstm_output: (batch_size, seq_len, hidden_dim)
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1), dim=1
        )
        # (batch_size, seq_len)

        # 加权求和
        weighted = torch.bmm(
            attention_weights.unsqueeze(1), lstm_output  # unsqueeze增加一个维度
        ).squeeze(1)
        # (batch_size, hidden_dim)

        return weighted, attention_weights

class LSTMAttentionClassifier(nn.Module):
    """LSTM + Attention 文本分类"""

    def __init__(self, vocab_size, embedding_dim, hidden_dim,
                 num_classes, bidirectional=True, dropout=0.5):
        super().__init__()  # super()调用父类方法

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
                           bidirectional=bidirectional)

        fc_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.attention = AttentionLayer(fc_dim)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(fc_dim, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        attn_out, attn_weights = self.attention(lstm_out)
        out = self.dropout(attn_out)
        out = self.fc(out)
        return out, attn_weights

print("LSTM + Attention模型创建成功")

6. BERT文本分类

6.1 BERT微调原理

Text Only
BERT微调流程:

输入: [CLS] 这 部 电 影 很 好 看 [SEP]
┌──────────────────────┐
│   BERT Encoder       │  12层 Transformer
│   (预训练参数)         │
└──────────────────────┘
   [CLS] 的输出向量 (768维)
┌──────────────────────┐
│   分类头              │  Linear(768, num_classes)
│   (随机初始化)         │
└──────────────────────┘
   类别概率 [0.9, 0.1]   → 正面

6.2 使用Hugging Face实现BERT分类

Python
# pip install transformers datasets

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch

# ==================
# 1. 加载预训练模型和分词器
# ==================

model_name = "bert-base-chinese"  # 中文BERT

# 加载分词器
tokenizer = BertTokenizer.from_pretrained(model_name)

# 分词示例
text = "这部电影太好看了"
tokens = tokenizer(text, padding='max_length', truncation=True,
                   max_length=32, return_tensors='pt')
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention Mask: {tokens['attention_mask']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")

# ==================
# 2. 构建数据集
# ==================

class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):  # __getitem__定义索引访问行为
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.long),
        }

# 准备数据
train_texts = [
    "这部电影太好看了", "非常喜欢推荐", "精彩绝伦",
    "画面很美", "演技出色", "好看感人",
    "太难看了", "剧情太差", "垃圾电影",
    "浪费时间", "很失望", "看不下去",
]
train_labels = [1]*6 + [0]*6

train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)

# ==================
# 3. 加载BERT模型
# ==================

model = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)

# ==================
# 4. 训练配置和训练
# ==================

# 简单的训练循环(不使用Trainer的方式)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

model.train()
for epoch in range(3):
    total_loss = 0
    for i in range(0, len(train_texts), 4):
        batch_texts = train_texts[i:i+4]
        batch_labels = train_labels[i:i+4]

        encoding = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=64,
            return_tensors='pt',
        )
        encoding['labels'] = torch.tensor(batch_labels)

        outputs = model(**encoding)
        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# ==================
# 5. 预测
# ==================

model.eval()
test_texts = ["这个电影很好看强烈推荐", "太差了别浪费时间"]

for text in test_texts:
    encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**encoding)
        logits = outputs.logits
        pred = logits.argmax(dim=1).item()
        probs = torch.softmax(logits, dim=1)[0]

    sentiment = "正面" if pred == 1 else "负面"
    print(f"'{text}' → {sentiment} (正面概率: {probs[1]:.3f})")

6.3 BERT微调最佳实践

Python
# BERT微调的关键超参数
best_practices = {
    "学习率": "2e-5 到 5e-5(比普通深度学习小很多)",
    "Batch Size": "16 或 32(显存允许的情况下尽量大)",
    "训练轮数": "2-4 epochs(过多容易过拟合)",
    "最大序列长度": "128 或 256(根据任务调整)",
    "优化器": "AdamW(带权重衰减的Adam)",
    "学习率调度": "线性warmup + 线性衰减",
    "Dropout": "保持BERT默认的0.1",
    "梯度裁剪": "最大梯度范数1.0",
}

print("BERT微调最佳实践:")
for key, value in best_practices.items():
    print(f"  📌 {key}: {value}")

7. 多标签分类

Python
class MultiLabelClassifier(nn.Module):
    """多标签文本分类"""

    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_labels)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.lstm(embedded)
        pooled = output.mean(dim=1)
        logits = self.fc(pooled)
        return logits  # 注意:不用softmax,用sigmoid

# 多标签分类使用BCEWithLogitsLoss
# criterion = nn.BCEWithLogitsLoss()

# 示例:一篇新闻可能同时属于"科技"和"财经"
multi_label_example = {
    "文本": "苹果公司发布了新款iPhone,股价应声上涨5%",
    "标签": {"科技": 1, "财经": 1, "体育": 0, "娱乐": 0},
}
print(f"多标签示例: {multi_label_example}")

8. 情感分析实战

8.1 完整的情感分析项目

Python
"""
情感分析实战:中文电影评论情感分类
使用多种方法进行对比
"""

import numpy as np
from collections import Counter
import jieba
import re

# ==================
# 数据准备
# ==================

# 模拟数据集
positive_reviews = [
    "非常好看的一部电影,剧情紧凑,演技在线",
    "强烈推荐这部电影,看完让人深思",
    "画面精美,导演功力深厚,五星好评",
    "这是今年看过最好的电影,值得二刷",
    "笑中带泪,很感人的一部作品",
    "特效太震撼了,值得去IMAX看",
    "演员的演技都很棒,剧情也很吸引人",
    "编剧太厉害了,每个角色都立住了",
    "看完心情大好,强烈推荐给朋友们",
    "国产电影的骄傲,必须支持",
    "剧情出人意料,结局很惊艳",
    "配乐非常好听,为电影增色不少",
]

negative_reviews = [
    "太难看了,浪费两个小时",
    "剧情老套,毫无新意",
    "演技尴尬,看不下去",
    "特效五毛钱,剧情一塌糊涂",
    "这是我看过最差的电影",
    "导演在想什么,完全不知所云",
    "无聊至极,全程想睡觉",
    "烂片预警,千万别买票",
    "浪费钱浪费时间,后悔看了",
    "演员演技太差了,剧情也不行",
    "失望透顶,还不如在家看电视",
    "逻辑混乱剧情崩塌的烂片",
]

all_texts = positive_reviews + negative_reviews
all_labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)

# 文本预处理
def preprocess(text):
    text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', '', text)
    words = jieba.lcut(text)
    return words

tokenized_texts = [preprocess(t) for t in all_texts]

# 划分训练集和测试集
np.random.seed(42)
indices = np.random.permutation(len(all_texts))
split = int(0.8 * len(indices))
train_idx, test_idx = indices[:split], indices[split:]

X_train = [tokenized_texts[i] for i in train_idx]
y_train = [all_labels[i] for i in train_idx]
X_test = [tokenized_texts[i] for i in test_idx]
y_test = [all_labels[i] for i in test_idx]

# ==================
# 方法1: 朴素贝叶斯
# ==================

nb_clf = MultinomialNaiveBayes(alpha=1.0)
nb_clf.fit(X_train, y_train)
nb_preds = nb_clf.predict(X_test)
nb_acc = sum(p == t for p, t in zip(nb_preds, y_test)) / len(y_test)

print("="*50)
print("情感分析方法对比")
print("="*50)
print(f"\n1. 朴素贝叶斯准确率: {nb_acc:.4f}")

# ==================
# 方法2: TF-IDF + SVM (使用sklearn)
# ==================

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

train_texts_str = [' '.join(t) for t in X_train]
test_texts_str = [' '.join(t) for t in X_test]

vectorizer = TfidfVectorizer(max_features=3000)
X_train_tfidf = vectorizer.fit_transform(train_texts_str)
X_test_tfidf = vectorizer.transform(test_texts_str)

svm_clf = LinearSVC(C=1.0)
svm_clf.fit(X_train_tfidf, y_train)
svm_preds = svm_clf.predict(X_test_tfidf)
svm_acc = sum(p == t for p, t in zip(svm_preds, y_test)) / len(y_test)

print(f"2. TF-IDF + SVM准确率: {svm_acc:.4f}")

print(f"\n测试样本预测结果:")
for i, idx in enumerate(test_idx):
    text = all_texts[idx]
    true_label = "正面" if all_labels[idx] == 1 else "负面"
    nb_label = "正面" if nb_preds[i] == 1 else "负面"
    svm_label = "正面" if svm_preds[i] == 1 else "负面"
    status = "✓" if nb_preds[i] == all_labels[idx] else "✗"
    print(f"  {status} '{text[:15]}...' 真实:{true_label} NB:{nb_label} SVM:{svm_label}")  # 切片操作,取前n个元素

9. 面试要点

🔑 面试高频考点

考点1:TextCNN的原理和为什么有效?

Text Only
✅ 标准答案要点:
1. 架构:嵌入层 → 多尺度卷积 → 最大池化 → 全连接 → Softmax
2. 为什么有效:
   - 不同大小的卷积核捕捉不同长度的n-gram特征
   - Max Pooling捕捉最重要的特征,忽略位置信息
   - 参数量少,训练速度快
3. 关键超参数:
   - 卷积核大小:通常(2,3,4)或(3,4,5)
   - 卷积核数量:100-300
   - Dropout:0.5
4. 局限性:
   - 只能捕捉局部特征(卷积核范围内)
   - 不适合长距离依赖

考点2:BERT微调文本分类的流程?

Text Only
✅ 标准答案要点:
1. 输入处理:[CLS] + tokens + [SEP],WordPiece分词
2. 使用[CLS]对应的输出向量作为文本表示
3. 在[CLS]向量上加一个线性分类头
4. 微调整个BERT + 分类头(端到端)
5. 关键超参数:
   - 学习率:2e-5~5e-5
   - Batch Size:16/32
   - Epochs:2-4
   - Warmup比例:10%
6. 注意事项:
   - 学习率不能太大(破坏预训练参数)
   - 不需要太多epoch(防止过拟合)

考点3:文本分类中如何处理类别不均衡?

Text Only
✅ 标准答案要点:
数据层面:
1. 过采样少类(SMOTE、随机过采样)
2. 欠采样多类(随机删除)
3. 数据增强(同义词替换、回译)

模型层面:
1. 类别权重:在loss中设置class_weight
2. Focal Loss:聚焦难分类样本
3. 代价敏感学习

评估层面:
1. 使用Macro F1而非Accuracy
2. 关注少类的Recall
3. 使用AUC-ROC

实战经验:
- 倾斜比例不大(<10:1):类别权重即可
- 严重不均衡(>100:1):考虑将问题转为异常检测

考点4:如何提升文本分类的效果?

Text Only
✅ 标准答案要点:
数据层面:
- 增加标注数据
- 数据增强(EDA、回译、对抗样本)
- 数据清洗和一致性检查

模型层面:
- 使用预训练模型(BERT/RoBERTa)
- 模型集成(投票、Stacking)
- 领域预训练(在领域数据上继续预训练)

训练技巧:
- 学习率warmup和衰减
- 标签平滑(Label Smoothing)
- 对抗训练(FGM、PGD)
- R-Drop正则化

特征层面:
- 引入额外特征(文本长度、特殊词频等)
- 多任务学习

10. 练习题

📝 基础题

  1. 解释朴素贝叶斯中"朴素"的含义,以及这个假设在文本分类中是否合理。

答案:"朴素"指条件独立性假设——给定类别下各特征(词)相互独立,即P(w1,w2,...|c)=P(w1|c)·P(w2|c)·...。这在文本中不完全合理,因为词间存在强依赖(如"人工"与"智能"常共现)。但实际效果好:①分类只需比较P(c|x)的相对大小,误差对各类别影响类似;②高维稀疏空间中相关性影响被稀释;③训练快、不易过拟合,在小数据集上表现稳健。

  1. 比较TextCNN和TextRNN在文本分类中的优缺点。

答案TextCNN优点:并行计算速度快;多尺度卷积核能捕捉不同长度的局部n-gram特征;实现简单。缺点:只能捕捉局部特征,难以建模长距离依赖。TextRNN(LSTM/GRU)优点:能建模长距离依赖和序列关系;对词序敏感,适合需要上下文理解的任务。缺点:串行计算训练速度慢;长文本仍有梯度问题;模型较复杂。实践建议:短文本分类优选TextCNN,长文本或需上下文理解的任务选BiLSTM。

💻 编程题

  1. 使用Scikit-learn实现一个完整的多类别新闻分类器(体育、科技、财经、娱乐),使用TF-IDF + 多种分类器对比。

  2. 实现一个TextCNN模型,在中文情感分析数据集上训练并评估。

  3. 使用Hugging Face的BERT模型完成一个意图识别任务。

🔬 思考题

  1. 在大模型时代,是否还需要单独训练文本分类模型?什么情况下应该使用BERT微调而不是直接用GPT?

答案:仍需要。用BERT微调的场景:①高并发/低延迟:BERT-base仅110M参数,推理远快于GPT级大模型;②数据敏感:企业数据不能调用外部API;③垂直领域高精度:微调模型精度更高;④成本敏感:大模型API批量分类不经济。用GPT更好的场景:①无标注数据时零样本分类;②多任务统一处理;③快速原型验证无需训练。


✅ 自我检查清单

Text Only
□ 我理解朴素贝叶斯在文本分类中的原理
□ 我能手写TextCNN的PyTorch代码
□ 我知道BERT微调分类的完整流程和关键超参数
□ 我能区分多标签和多分类问题
□ 我知道如何处理类别不均衡问题
□ 我完成了情感分析实战代码
□ 我完成了至少3道练习题

📚 延伸阅读

  1. TextCNN原始论文: Convolutional Neural Networks for Sentence Classification
  2. BERT论文: Pre-training of Deep Bidirectional Transformers
  3. Hugging Face文本分类教程
  4. 中文情感分析综述

下一篇05-序列标注 — 命名实体识别与词性标注