14. 代码能力与Agent专项训练¶

⚠️ 时效性说明：本章涉及前沿模型/价格/榜单等信息，可能随版本快速变化；请以论文原文、官方发布页和 API 文档为准。

核心问题：现在大模型很多都针对编程、Agent进行升级，具体是如何针对性训练的？

目录¶

代码能力专项训练
Agent能力专项训练
工具调用能力训练
推理能力强化训练
训练数据构建方法
面试高频问答

1. 代码能力专项训练¶

1.1 代码能力训练全景¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                    代码能力训练全景                              │├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  阶段1: 代码预训练                                              │
│  ├── 大规模代码语料（GitHub、代码教程、Stack Overflow）          │
│  ├── 代码占比：15-30%（如LLaMA 3）                              │
│  ├── 目标：学习代码语法、模式、常见API                           │
│  └── 数据量：数万亿代码tokens                                   │
│                                                                 │
│  阶段2: 代码SFT                                                 │
│  ├── 高质量指令-代码对                                          │
│  ├── 多语言覆盖（Python、Java、C++、Rust等）                    │
│  ├── 任务多样性（补全、解释、调试、重构）                        │
│  └── 数据量：数百万到数千万样本                                 │
│                                                                 │
│  阶段3: 代码RLHF/DPO                                            │
│  ├── 基于代码执行反馈的奖励                                     │
│  ├── 通过测试用例作为奖励信号                                   │
│  ├── 人类偏好对齐（可读性、安全性）                              │
│  └── 迭代优化                                                   │
│                                                                 │
│  阶段4: 代码强化学习（2025新范式）                              │
│  ├── 自动生成编程题目                                           │
│  ├── 模型尝试解决 → 单元测试验证 → 奖励信号                     │
│  ├── 大规模强化训练（DeepSeek-R1风格）                          │
│  └── 能力涌现：复杂算法、系统设计                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1.2 代码预训练数据构建¶

Python

class CodePretrainingDataPipeline:
    """
    代码预训练数据构建流程
    """

    def __init__(self):
        self.sources = {
            "github": {
                "数据量": "数亿仓库",
                "处理": "去重、质量过滤、许可证筛选",
                "占比": "60%"
            },
            "code_tutorials": {
                "来源": "教程网站、博客、文档",
                "处理": "提取代码块、配对上下文",
                "占比": "20%"
            },
            "stack_overflow": {
                "来源": "问答对、代码片段",
                "处理": "质量评分、答案筛选",
                "占比": "10%"
            },
            "synthetic": {
                "来源": "用强模型生成",
                "处理": "执行验证、质量过滤",
                "占比": "10%"
            }
        }

    def filter_quality(self, code: str) -> bool:
        """代码质量过滤"""
        checks = [
            self._check_syntax,        # 语法正确
            self._check_length,        # 长度适中
            self._check_complexity,    # 复杂度合理
            self._check_no_secrets,    # 无敏感信息
            self._check_has_comments,  # 有注释（可选）
        ]
        return all(check(code) for check in checks)

    def _check_syntax(self, code: str) -> bool:
        """语法检查"""
        # 使用AST解析验证
        try:
            import ast
            ast.parse(code)
            return True
        except:
            return False

    def deduplicate(self, codes: list) -> list:
        """代码去重"""
        # 基于MinHash/LSH的近似去重
        # 基于精确哈希的去重
        # 基于语义相似度的去重
        pass


class CodeDataMixing:
    """
    代码数据配比策略
    """

    # 通用代码模型配比
    general_code_mix = {
        "Python": "30%",      # 最流行，数据最多
        "JavaScript/TypeScript": "20%",
        "Java": "15%",
        "C/C++": "10%",
        "Go": "5%",
        "Rust": "5%",
        "其他": "15%"
    }

    # 代码能力专项配比（如CodeLlama）
    code_optimized_mix = {
        "代码": "70%",        # 大幅提升代码比例
        "代码+自然语言": "20%",  # 代码解释、文档
        "通用文本": "10%"
    }

1.3 代码SFT数据构建¶

Python

class CodeSFTDataBuilder:
    """
    代码SFT数据构建
    """

    def __init__(self):
        self.task_types = [
            "code_completion",     # 代码补全
            "code_generation",     # 代码生成
            "code_explanation",    # 代码解释
            "code_debugging",      # 代码调试
            "code_refactoring",    # 代码重构
            "code_translation",    # 代码翻译
            "test_generation",     # 测试生成
            "doc_generation",      # 文档生成
        ]

    def build_instruction_data(self):
        """
        构建指令-代码对
        """
        examples = {
            # 代码生成
            "code_generation": {
                "instruction": "实现一个函数，计算两个日期之间的工作日数量",
                "input": "",
                "output": '''
def count_workdays(start_date, end_date):
    """
    计算两个日期之间的工作日数量（排除周末）

    Args:
        start_date: 开始日期 (datetime.date)
        end_date: 结束日期 (datetime.date)

    Returns:
        int: 工作日数量
    """
    from datetime import timedelta

    workdays = 0
    current = start_date

    while current <= end_date:
        # 0=Monday, 6=Sunday
        if current.weekday() < 5:
            workdays += 1
        current += timedelta(days=1)

    return workdays
'''
            },

            # 代码解释
            "code_explanation": {
                "instruction": "解释以下代码的功能",
                "input": '''
def f(x):
    return x * f(x-1) if x > 1 else 1
''',
                "output": '''
这是一个递归实现的阶乘函数。

功能说明：
1. 计算输入数字x的阶乘（x!）
2. 递归终止条件：当x <= 1时返回1
3. 递归调用：x * f(x-1)

示例：
- f(5) = 5 * f(4) = 5 * 4 * f(3) = ... = 120

注意：对于大数可能导致栈溢出，建议使用迭代版本或尾递归优化。
'''
            },

            # 代码调试
            "code_debugging": {
                "instruction": "找出以下代码的bug并修复",
                "input": '''
def find_max(lst):
    max_val = 0
    for item in lst:
        if item > max_val:
            max_val = item
    return max_val
''',
                "output": '''
Bug分析：
当列表中所有值都是负数时，函数会错误地返回0而不是最大的负数。

修复方案：
def find_max(lst):
    if not lst:
        return None  # 处理空列表
    max_val = lst[0]  # 使用列表第一个元素初始化
    for item in lst[1:]:
        if item > max_val:
            max_val = item
    return max_val

或者更简洁的写法：
def find_max(lst):
    return max(lst) if lst else None
'''
            }
        }

        return examples


class SyntheticCodeData:
    """
    合成代码数据生成
    用强模型生成高质量代码训练数据
    """

    def generate_code_problems(self, difficulty="medium"):
        """
        自动生成编程题目
        """
        prompt = f"""
        生成一个{difficulty}难度的编程题目，包含：
        1. 问题描述
        2. 输入输出格式
        3. 示例
        4. 约束条件
        5. 参考解答（Python）
        """
        # 调用GPT-4/Claude生成
        pass

    def generate_with_tests(self, problem_description):
        """
        生成代码 + 测试用例
        """
        # 1. 生成代码
        # 2. 生成测试用例
        # 3. 执行验证
        # 4. 只保留通过测试的样本
        pass

1.4 代码强化学习训练¶

Python

class CodeRLTrainer:
    """
    代码强化学习训练器
    基于执行反馈的强化学习
    """

    def __init__(self, model, test_executor):
        self.model = model
        self.test_executor = test_executor  # 代码执行器

    def compute_reward(self, generated_code, test_cases):
        """
        基于测试用例通过率计算奖励
        """
        total_reward = 0

        for test_case in test_cases:
            try:
                # 执行生成的代码
                result = self.test_executor.execute(
                    generated_code, 
                    test_case["input"]
                )

                # 检查输出是否正确
                if result == test_case["expected_output"]:
                    total_reward += 1.0
                else:
                    total_reward += 0.0

            except Exception as e:
                # 代码执行出错
                total_reward += -0.5  # 负奖励

        # 归一化
        return total_reward / len(test_cases)

    def train_step(self, problem, test_cases):
        """
        单步训练
        """
        # 1. 生成代码
        generated_code = self.model.generate(problem)

        # 2. 计算奖励
        reward = self.compute_reward(generated_code, test_cases)

        # 3. 策略梯度更新
        # 使用PPO/GRPO等算法
        pass


class CodeExecutionEnvironment:
    """
    代码执行环境
    安全地执行模型生成的代码
    """

    def __init__(self, timeout=5, memory_limit="256M"):
        self.timeout = timeout
        self.memory_limit = memory_limit

    def execute(self, code: str, input_data: str) -> str:
        """
        在沙箱中执行代码
        """
        import subprocess
        import tempfile

        with tempfile.NamedTemporaryFile(mode='w', suffix='.py') as f:
            f.write(code)
            f.flush()

            try:
                result = subprocess.run(
                    ['python', f.name],
                    input=input_data,
                    capture_output=True,
                    text=True,
                    timeout=self.timeout
                )
                return result.stdout
            except subprocess.TimeoutExpired:
                raise TimeoutError("代码执行超时")
            except Exception as e:
                raise RuntimeError(f"执行错误: {e}")

2. Agent能力专项训练¶

2.1 Agent能力训练全景¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                    Agent能力训练全景                             │├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  核心能力维度：                                                  │
│                                                                 │
│  1. 工具调用能力                                                │
│     ├── 函数调用格式                                            │
│     ├── 参数提取与填充                                          │
│     ├── 多工具协调                                              │
│     └── 错误处理与重试                                          │
│                                                                 │
│  2. 规划能力                                                    │
│     ├── 任务分解                                                │
│     ├── 步骤排序                                                │
│     ├── 依赖管理                                                │
│     └── 动态调整                                                │
│                                                                 │
│  3. 记忆能力                                                    │
│     ├── 短期记忆（上下文）                                      │
│     ├── 长期记忆（向量存储）                                    │
│     ├── 工作记忆（状态管理）                                    │
│     └── 记忆检索与更新                                          │
│                                                                 │
│  4. 推理能力                                                    │
│     ├── 逻辑推理                                                │
│     ├── 因果推理                                                │
│     ├── 反思与自我修正                                          │
│     └── 多路径探索                                              │
│                                                                 │
│  5. 交互能力                                                    │
│     ├── 多轮对话                                                │
│     ├── 澄清问题                                                │
│     ├── 反馈整合                                                │
│     └── 结果呈现                                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 Agent训练数据构建¶

Python

class AgentTrainingDataBuilder:
    """
    Agent训练数据构建
    """

    def __init__(self):
        self.data_types = {
            "tool_calling": "工具调用数据",
            "planning": "规划数据",
            "reasoning": "推理数据",
            "multi_turn": "多轮对话数据"
        }

    def build_tool_calling_data(self):
        """
        构建工具调用训练数据
        """
        examples = [
            {
                "messages": [
                    {
                        "role": "user",
                        "content": "北京今天天气怎么样？"
                    },
                    {
                        "role": "assistant",
                        "content": None,
                        "tool_calls": [
                            {
                                "id": "call_001",
                                "type": "function",
                                "function": {
                                    "name": "get_weather",
                                    "arguments": '{"city": "北京"}'
                                }
                            }
                        ]
                    },
                    {
                        "role": "tool",
                        "tool_call_id": "call_001",
                        "content": '{"temperature": 25, "weather": "晴", "humidity": 60}'
                    },
                    {
                        "role": "assistant",
                        "content": "北京今天天气晴朗，气温25°C，湿度60%。"
                    }
                ]
            }
        ]
        return examples

    def build_planning_data(self):
        """
        构建任务规划训练数据
        """
        examples = [
            {
                "task": "帮我预订明天从北京到上海的机票，并订一家上海的酒店",
                "plan": [
                    {
                        "step": 1,
                        "action": "search_flights",
                        "params": {"from": "北京", "to": "上海", "date": "明天"},
                        "reasoning": "首先查询可用航班"
                    },
                    {
                        "step": 2,
                        "action": "book_flight",
                        "params": {"flight_id": "待定"},
                        "reasoning": "根据用户偏好选择航班并预订",
                        "depends_on": [1]
                    },
                    {
                        "step": 3,
                        "action": "search_hotels",
                        "params": {"city": "上海", "check_in": "明天"},
                        "reasoning": "查询上海酒店"
                    },
                    {
                        "step": 4,
                        "action": "book_hotel",
                        "params": {"hotel_id": "待定"},
                        "reasoning": "根据用户偏好选择酒店并预订",
                        "depends_on": [3]
                    }
                ]
            }
        ]
        return examples

    def build_reasoning_data(self):
        """
        构建推理训练数据（ReAct风格）
        """
        examples = [
            {
                "question": "iPhone 15 Pro和Samsung S24 Ultra哪个更适合摄影爱好者？",
                "trajectory": [
                    {
                        "thought": "需要比较两款手机的摄影功能",
                        "action": "search",
                        "action_input": "iPhone 15 Pro camera specs"
                    },
                    {
                        "observation": "iPhone 15 Pro: 48MP主摄, 5倍光学变焦, ProRAW支持"
                    },
                    {
                        "thought": "现在查询Samsung的规格",
                        "action": "search",
                        "action_input": "Samsung S24 Ultra camera specs"
                    },
                    {
                        "observation": "S24 Ultra: 200MP主摄, 10倍光学变焦, 8K视频"
                    },
                    {
                        "thought": "比较两者的优势和适用场景",
                        "action": "compare",
                        "action_input": "iPhone 15 Pro vs S24 Ultra photography"
                    },
                    {
                        "final_answer": "对于摄影爱好者：\n\n" +
                            "选择iPhone 15 Pro如果：\n" +
                            "- 偏好自然色彩和一致性\n" +
                            "- 需要ProRAW格式进行后期处理\n" +
                            "- 重视视频拍摄质量\n\n" +
                            "选择S24 Ultra如果：\n" +
                            "- 需要超高分辨率（200MP）\n" +
                            "- 需要超长变焦（10倍光学）\n" +
                            "- 拍摄8K视频"
                    }
                ]
            }
        ]
        return examples


class AgentSFTDataFormats:
    """
    Agent SFT数据格式
    """

    # OpenAI Function Calling格式
    openai_format = {
        "messages": [
            {"role": "system", "content": "你是一个有帮助的助手。"},
            {"role": "user", "content": "用户问题"},
            {"role": "assistant", "tool_calls": [...]},
            {"role": "tool", "content": "..."}
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "function_name",
                    "description": "函数描述",
                    "parameters": {
                        "type": "object",
                        "properties": {...},
                        "required": [...]
                    }
                }
            }
        ]
    }

    # ReAct格式
    react_format = {
        "input": "用户问题",
        "trajectory": [
            {"thought": "思考过程", "action": "动作", "action_input": "动作输入"},
            {"observation": "观察结果"},
            # ... 更多步骤
            {"thought": "最终思考", "action": "Finish", "action_input": "最终答案"}
        ]
    }

2.3 Agent强化学习训练¶

Python

class AgentRLTrainer:
    """
    Agent强化学习训练器
    """

    def __init__(self, model, env):
        self.model = model
        self.env = env  # Agent执行环境

    def compute_agent_reward(self, trajectory, task_success):
        """
        计算Agent执行奖励

        奖励组成：
        1. 任务成功奖励
        2. 步骤效率奖励
        3. 工具使用正确性奖励
        4. 格式正确性奖励
        """
        reward = 0

        # 1. 任务成功奖励（主要）
        if task_success:
            reward += 10.0

        # 2. 步骤效率奖励（鼓励简洁）
        optimal_steps = self.env.get_optimal_steps()
        actual_steps = len(trajectory)
        efficiency = max(0, 1 - (actual_steps - optimal_steps) / optimal_steps)
        reward += efficiency * 2.0

        # 3. 工具使用正确性
        correct_tool_usage = sum(1 for t in trajectory if t.get("tool_correct", False))
        reward += correct_tool_usage * 0.5

        # 4. 格式正确性
        format_correct = all(self._check_format(t) for t in trajectory)
        if format_correct:
            reward += 1.0

        return reward

    def train_with_environment(self, tasks):
        """
        在环境中训练Agent
        """
        for task in tasks:
            # 1. Agent执行任务
            trajectory = []
            state = self.env.reset(task)
            done = False

            while not done:
                # 模型决策
                action = self.model.act(state, trajectory)

                # 执行动作
                next_state, reward, done, info = self.env.step(action)

                trajectory.append({
                    "state": state,
                    "action": action,
                    "reward": reward,
                    "next_state": next_state
                })

                state = next_state

            # 2. 计算总奖励
            total_reward = self.compute_agent_reward(
                trajectory, 
                info.get("success", False)
            )

            # 3. 策略更新
            self.update_policy(trajectory, total_reward)


class AgentTrainingEnvironment:
    """
    Agent训练环境
    模拟真实场景供Agent练习
    """

    def __init__(self):
        self.tools = self._register_tools()
        self.current_task = None

    def _register_tools(self):
        """注册可用工具"""
        return {
            "search": self._search_tool,
            "calculator": self._calculator_tool,
            "weather": self._weather_tool,
            "database": self._database_tool,
            "api_call": self._api_call_tool
        }

    def reset(self, task):
        """重置环境，开始新任务"""
        self.current_task = task
        return {"task": task, "available_tools": list(self.tools.keys())}

    def step(self, action):
        """执行动作，返回新状态和奖励"""
        tool_name = action.get("tool")
        tool_input = action.get("input")

        if tool_name not in self.tools:
            return {
                "error": f"Unknown tool: {tool_name}"
            }, -1.0, False, {"success": False}

        try:
            result = self.tools[tool_name](tool_input)

            # 检查是否完成任务
            done = self._check_task_completion(result)

            return result, 0.1, done, {"success": done}

        except Exception as e:
            return {"error": str(e)}, -0.5, False, {"success": False}

3. 工具调用能力训练¶

3.1 工具调用训练方法¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                    工具调用能力训练                              │├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  阶段1: 工具调用格式学习                                        │
│  ├── 学习JSON Schema格式                                       │
│  ├── 学习参数类型（string, number, boolean, array, object）    │
│  └── 学习必需参数vs可选参数                                    │
│                                                                 │
│  阶段2: 单工具调用训练                                          │
│  ├── 简单参数提取                                              │
│  ├── 从自然语言中提取参数                                      │
│  └── 处理默认值和可选参数                                      │
│                                                                 │
│  阶段3: 多工具选择训练                                          │
│  ├── 根据用户意图选择合适工具                                  │
│  ├── 处理工具描述模糊的情况                                    │
│  └── 学习工具之间的区别                                        │
│                                                                 │
│  阶段4: 多工具协调训练                                          │
│  ├── 顺序调用（依赖关系）                                      │
│  ├── 并行调用（无依赖）                                        │
│  └── 条件调用（根据结果决定）                                  │
│                                                                 │
│  阶段5: 错误处理训练                                            │
│  ├── 参数错误时的修正                                          │
│  ├── 工具调用失败时的重试                                      │
│  └── 结果异常时的处理                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2 工具调用数据构建¶

Python

class ToolCallingDataBuilder:
    """
    工具调用训练数据构建
    """

    def __init__(self):
        self.tool_schemas = self._define_tools()

    def _define_tools(self):
        """定义工具Schema"""
        return [
            {
                "name": "get_weather",
                "description": "获取指定城市的天气信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "城市名称，如'北京'"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "温度单位"
                        }
                    },
                    "required": ["city"]
                }
            },
            {
                "name": "search_web",
                "description": "在网络上搜索信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "搜索关键词"
                        },
                        "num_results": {
                            "type": "integer",
                            "description": "返回结果数量",
                            "default": 5
                        }
                    },
                    "required": ["query"]
                }
            }
        ]

    def generate_training_samples(self):
        """生成训练样本"""
        samples = []

        # 单工具调用样本
        samples.extend(self._single_tool_samples())

        # 多工具选择样本
        samples.extend(self._tool_selection_samples())

        # 多工具协调样本
        samples.extend(self._multi_tool_samples())

        return samples

    def _single_tool_samples(self):
        """单工具调用样本"""
        return [
            {
                "user_input": "北京今天天气怎么样？",
                "expected_call": {
                    "name": "get_weather",
                    "arguments": {"city": "北京"}
                }
            },
            {
                "user_input": "帮我搜索一下Python教程",
                "expected_call": {
                    "name": "search_web",
                    "arguments": {"query": "Python教程"}
                }
            }
        ]

    def _tool_selection_samples(self):
        """工具选择样本"""
        return [
            {
                "user_input": "上海明天会下雨吗？",
                "available_tools": ["get_weather", "search_web", "calculator"],
                "expected_tool": "get_weather",
                "reasoning": "天气查询应该使用天气工具"
            },
            {
                "user_input": "2024年世界杯在哪里举办？",
                "available_tools": ["get_weather", "search_web", "calculator"],
                "expected_tool": "search_web",
                "reasoning": "实时信息查询需要网络搜索"
            }
        ]

    def _multi_tool_samples(self):
        """多工具协调样本"""
        return [
            {
                "user_input": "比较北京和上海今天的天气",
                "expected_calls": [
                    {"name": "get_weather", "arguments": {"city": "北京"}},
                    {"name": "get_weather", "arguments": {"city": "上海"}}
                ],
                "execution_order": "parallel"  # 可并行执行
            },
            {
                "user_input": "查询苹果公司的股价并计算如果投资1万能买多少股",
                "expected_calls": [
                    {"name": "get_stock_price", "arguments": {"symbol": "AAPL"}},
                    {"name": "calculate", "arguments": {"expression": "10000 / stock_price"}}
                ],
                "execution_order": "sequential"  # 需要顺序执行
            }
        ]


class SyntheticToolData:
    """
    合成工具调用数据
    """

    def generate_from_tool_schema(self, tool_schema, num_samples=100):
        """
        根据工具Schema自动生成训练数据
        """
        samples = []

        for _ in range(num_samples):
            # 1. 生成参数值
            arguments = self._generate_arguments(tool_schema["parameters"])

            # 2. 生成对应的自然语言描述
            user_input = self._generate_user_input(tool_schema, arguments)

            samples.append({
                "user_input": user_input,
                "expected_call": {
                    "name": tool_schema["name"],
                    "arguments": arguments
                }
            })

        return samples

    def _generate_arguments(self, parameters_schema):
        """根据Schema生成参数"""
        import random

        arguments = {}
        for prop, spec in parameters_schema["properties"].items():
            if spec["type"] == "string":
                if "enum" in spec:
                    arguments[prop] = random.choice(spec["enum"])
                else:
                    arguments[prop] = self._generate_string(prop)
            elif spec["type"] == "integer":
                arguments[prop] = random.randint(1, 100)
            elif spec["type"] == "number":
                arguments[prop] = round(random.uniform(0, 100), 2)
            elif spec["type"] == "boolean":
                arguments[prop] = random.choice([True, False])

        return arguments

4. 推理能力强化训练¶

4.1 推理训练方法¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                    推理能力强化训练                              │├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  方法1: 思维链蒸馏（CoT Distillation）                          │
│  ├── 用强模型生成思维链                                        │
│  ├── 学生模型学习思维链格式                                    │
│  └── 适用于有强模型API的场景                                   │
│                                                                 │
│  方法2: 自我生成+验证（Self-Generate + Verify）                │
│  ├── 模型生成多个推理路径                                      │
│  ├── 用验证器（如代码执行、数学验证）筛选正确路径              │
│  └── 用正确路径训练模型                                        │
│                                                                 │
│  方法3: 强化学习推理（DeepSeek-R1风格）                        │
│  ├── 大规模自动生成推理问题                                    │
│  ├── 模型尝试解决 → 自动验证 → 奖励信号                        │
│  ├── GRPO/PPO策略优化                                          │
│  └── 能力涌现：反思、回溯、多路径探索                          │
│                                                                 │
│  方法4: 过程监督（Process Supervision）                        │
│  ├── 不仅监督最终答案                                          │
│  ├── 对每一步推理给予奖励                                      │
│  └── 更精细的信号，但标注成本高                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.2 推理训练代码示例¶

Python

class ReasoningRLTrainer:
    """
    推理能力强化学习训练器
    类似DeepSeek-R1的训练方法
    """

    def __init__(self, model, verifier):
        self.model = model
        self.verifier = verifier  # 答案验证器

    def generate_reasoning_problem(self, domain="math"):
        """
        自动生成推理问题
        """
        if domain == "math":
            return self._generate_math_problem()
        elif domain == "code":
            return self._generate_code_problem()
        elif domain == "logic":
            return self._generate_logic_problem()

    def _generate_math_problem(self):
        """生成数学问题"""
        # 可以用模板或强模型生成
        return {
            "problem": "一个水池有两个进水管和一个出水管。单开A管需要6小时注满，单开B管需要4小时注满，单开出水管C需要8小时放完。如果三管同时开，多少小时能注满？",
            "answer": "12/5小时",  # 或更精确的验证方法
            "verification_type": "math"
        }

    def train_with_rl(self, problems, num_epochs=1000):
        """
        强化学习训练循环
        """
        for epoch in range(num_epochs):
            for problem in problems:
                # 1. 模型生成推理过程
                reasoning_trace = self.model.generate_with_reasoning(
                    problem["problem"]
                )

                # 2. 提取最终答案
                predicted_answer = self._extract_answer(reasoning_trace)

                # 3. 验证答案
                is_correct = self.verifier.verify(
                    predicted_answer, 
                    problem["answer"]
                )

                # 4. 计算奖励
                reward = 1.0 if is_correct else 0.0

                # 5. 额外奖励：推理步骤质量
                step_quality = self._evaluate_reasoning_quality(reasoning_trace)
                reward += step_quality * 0.2

                # 6. 策略更新（GRPO/PPO）
                self._update_policy(reasoning_trace, reward)

    def _evaluate_reasoning_quality(self, reasoning_trace):
        """
        评估推理步骤质量
        """
        scores = []

        # 检查是否有清晰步骤
        if "步骤" in reasoning_trace or "首先" in reasoning_trace:
            scores.append(0.3)

        # 检查是否有验证/反思
        if "验证" in reasoning_trace or "检查" in reasoning_trace:
            scores.append(0.3)

        # 检查是否有错误修正
        if "修正" in reasoning_trace or "重新计算" in reasoning_trace:
            scores.append(0.4)

        return sum(scores)


class ReasoningVerifier:
    """
    推理答案验证器
    """

    def verify(self, predicted, ground_truth, verification_type="exact"):
        """
        验证答案是否正确
        """
        if verification_type == "exact":
            return self._exact_match(predicted, ground_truth)
        elif verification_type == "math":
            return self._math_verify(predicted, ground_truth)
        elif verification_type == "code":
            return self._code_verify(predicted, ground_truth)

    def _exact_match(self, predicted, ground_truth):
        """精确匹配"""
        return predicted.strip().lower() == ground_truth.strip().lower()

    def _math_verify(self, predicted, ground_truth):
        """数学验证（支持等价表达式）"""
        try:
            # 提取数字和表达式
            pred_val = self._extract_math_value(predicted)
            true_val = self._extract_math_value(ground_truth)
            return abs(pred_val - true_val) < 1e-6
        except:
            return self._exact_match(predicted, ground_truth)

    def _code_verify(self, predicted, ground_truth):
        """代码验证（执行测试）"""
        # 执行代码并比较输出
        pass

5. 训练数据构建方法¶

5.1 数据构建策略对比¶

Text Only

┌────────────────────────────────────────────────────────────────────────┐
│                        数据构建策略对比                                 │├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  策略          质量    规模    成本    多样性    适用场景              │
│  ─────────────────────────────────────────────────────────────────────│
│  人工标注      最高    低      高      中        高质量SFT数据         │
│  强模型蒸馏    高      中      中      高        通用能力提升          │
│  自动生成+验证 中      高      低      高        大规模RL训练          │
│  真实用户数据  中      中      低      高        产品迭代              │
│  合成数据      变化    高      低      可控      特定领域              │
│                                                                        │
│  推荐组合：                                                            │
│  • SFT：人工标注(10%) + 强模型蒸馏(60%) + 真实数据(30%)               │
│  • RL：自动生成+验证为主，人工标注用于验证集                          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

5.2 数据质量控制¶

Python

class DataQualityController:
    """
    训练数据质量控制
    """

    def __init__(self):
        self.quality_checks = [
            self._check_format,
            self._check_length,
            self._check_coherence,
            self._check_safety,
            self._check_diversity
        ]

    def filter_samples(self, samples):
        """过滤低质量样本"""
        filtered = []
        for sample in samples:
            if all(check(sample) for check in self.quality_checks):
                filtered.append(sample)
        return filtered

    def _check_format(self, sample):
        """检查格式正确性"""
        # 检查必需字段
        # 检查JSON格式
        return True

    def _check_length(self, sample):
        """检查长度合理"""
        # 输入不能太短
        # 输出不能太短或太长
        return True

    def _check_coherence(self, sample):
        """检查一致性"""
        # 输入输出是否相关
        # 推理过程是否连贯
        return True

    def _check_safety(self, sample):
        """安全检查"""
        # 无有害内容
        # 无敏感信息
        return True

    def _check_diversity(self, sample):
        """多样性检查"""
        # 避免重复样本
        return True

6. 面试高频问答¶

Q1: 如何针对性提升大模型的代码能力？¶

答：代码能力提升是多阶段的：

预训练阶段：增加代码数据比例（15-30%），覆盖多语言

SFT阶段：高质量指令-代码对，任务多样性（补全、解释、调试）

RL阶段：基于执行反馈的强化学习，测试用例作为奖励信号

数据质量：代码质量过滤、去重、合成数据补充

关键是执行反馈：代码能力最客观的评估是能否通过测试用例

Q2: Agent的工具调用能力是如何训练的？¶

答：工具调用训练分阶段：

格式学习：学习JSON Schema、参数类型

单工具调用：从自然语言提取参数

多工具选择：根据意图选择合适工具

多工具协调：顺序/并行调用，处理依赖

错误处理：参数修正、失败重试

数据来源：人工标注 + 强模型蒸馏 + 合成生成

Q3: DeepSeek-R1是如何训练推理能力的？¶

答：DeepSeek-R1的训练范式：

冷启动SFT：少量高质量推理链数据

大规模RL：GRPO算法，自动生成问题 → 模型解决 → 验证 → 奖励

拒绝采样+SFT：收集RL高质量输出重新训练

全场景RL：加入安全、有用性奖励

关键是不依赖人类标注推理链，而是自动生成+验证的闭环

Q4: 合成数据在专项训练中扮演什么角色？¶

答：合成数据的作用：

规模扩展：可以生成海量训练数据

可控性：按需生成特定类型、难度的数据

填补空白：覆盖人工难以标注的场景

成本降低：比人工标注便宜100x+

风险：模型坍缩、偏见放大解决方案：合成+真实混合，严格质量过滤

Q5: 如何评估专项训练的效果？¶

答：评估方法：

代码能力： - HumanEval、MBPP（代码补全） - CodeContests（竞赛题目） - SWE-bench（真实GitHub issue）

Agent能力： - ToolBench（工具调用） - AgentBench（多任务Agent） - WebShop（网页交互）

推理能力： - MATH、GSM8K（数学） - GPQA（科学推理） - BBH（综合推理）

本章小结¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                      核心要点总结                                │├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. 代码能力训练：                                               │
│     预训练(代码15-30%) → SFT(多任务) → RL(执行反馈)             │
│     关键：测试用例作为客观奖励信号                               │
│                                                                 │
│  2. Agent能力训练：                                              │
│     工具调用 → 规划 → 记忆 → 推理 → 交互                        │
│     关键：多工具协调和错误处理                                   │
│                                                                 │
│  3. 工具调用训练：                                               │
│     格式学习 → 单工具 → 多工具选择 → 多工具协调                  │
│     关键：从自然语言准确提取参数                                 │
│                                                                 │
│  4. 推理能力训练：                                               │
│     CoT蒸馏 → 自我生成+验证 → 强化学习                          │
│     关键：自动验证+奖励闭环                                      │
│                                                                 │
│  5. 数据策略：                                                   │
│     人工标注(质量) + 强模型蒸馏(规模) + 合成数据(覆盖)           │
│     关键：质量过滤和多样性保证                                   │
│                                                                 │
│  6. 2026年最新趋势（Agentic Coding时代）：                      │
│     • GPT-5.3-Codex：旗舰级编程模型，SWE-Bench领先              │
│     • GPT-5.4：原生计算机使用，自动化工作流程                   │
│     • Claude Opus 4.6：代码审查与优化顶尖                       │
│     • Qwen 3.5-Coder：端侧可用的Agentic编程                    │
│     • MCP协议：标准化工具调用生态                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

📝 本章练习¶

🤔 思考题¶

代码训练数据：代码预训练数据的质量标准有哪些？如何处理代码中的敏感信息（API Key、密码等）？
SWE-Bench：SWE-Bench 评估代码 Agent 的原理是什么？为什么它比 HumanEval 更能反映真实编程能力？
Agent 训练：如何用 RL 训练代码 Agent？奖励信号设计有哪些挑战？
MCP 与代码：MCP 协议如何改变代码 Agent 的工具调用方式？相比硬编码 API 调用有什么优势？

💻 代码实践¶

入门：在 HumanEval 数据集上评估一个代码模型的 Pass@1 和 Pass@10
进阶：实现一个简单的代码 Agent，能读取文件、执行代码、根据错误信息修复 Bug
高级：用 GRPO + 单元测试奖励训练一个小型代码模型，对比训练前后的 HumanEval 得分

💡 参考答案

#### 思考题参考答案 **1. 代码训练数据质量** 质量标准： - **可执行性**：代码能通过语法检查和编译 - **完整性**：包含完整的上下文（import、依赖） - **多样性**：覆盖多种语言、框架、任务类型 - **文档化**：有注释和文档字符串敏感信息处理：正则匹配 + Secret Scanner 检测 API Key、密码、Token，替换为占位符或移除。 **2. SWE-Bench** 原理：给定 GitHub Issue 和代码仓库快照，Agent 需要生成 Patch 修复 Bug。通过运行测试套件验证修复是否正确。比 HumanEval 更真实的原因： - 真实 GitHub Issue（非人工编写） - 需要理解整个代码仓库的上下文 - 涉及多文件修改 - 测试用例是项目原有的（非简单的输入输出对） **3. Agent RL 训练** 奖励信号设计挑战： - **稀疏奖励**：任务完成才有奖励，中间步骤无信号 - **延迟奖励**：多步操作后才知道结果 - **噪声奖励**：测试通过不代表代码质量高解决方案：过程奖励（每步评估）、测试覆盖率作为辅助奖励、人工标注中间步骤。 **4. MCP 与代码** MCP 让代码 Agent 能动态发现和使用工具，无需为每个工具硬编码 API 调用。优势： - 工具可插拔（添加新工具无需修改 Agent 代码） - 标准化接口（所有工具统一调用方式） - 运行时发现（Agent 自动了解可用工具）

扩展阅读¶

最后更新日期： 2026-04-21 适用版本： LLM 学习教程 v2026