跳转至

03 - 推理服务部署(全面版)

⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。

📌 定位说明:本章侧重部署架构与推理框架深度对比。 - 📖 应用开发者快速部署指南请参考 LLM应用/11-大模型部署 - 📖 MLOps全链路工程化部署请参考 MLOps与AI工程化/02-模型部署与服务化

学习目标:掌握大模型推理服务的部署技术,包括vLLM、TGI、量化部署和API服务封装。


目录

  1. 推理部署概述
  2. vLLM部署实战
  3. Text Generation Inference (TGI)
  4. 量化部署
  5. API服务封装
  6. 性能优化与监控

推理部署概述

1.1 推理 vs 训练

Text Only
训练阶段 vs 推理阶段

训练阶段:
├── 目标:更新模型参数
├── 计算:前向 + 反向传播
├── 内存:存储参数、梯度、优化器状态
├── 批处理:大batch,追求吞吐
└── 精度:FP32/BF16/FP16

推理阶段:
├── 目标:生成预测结果
├── 计算:仅前向传播
├── 内存:只需模型参数 + KV Cache
├── 批处理:动态batch,追求延迟
└── 精度:可量化到INT8/INT4

推理优化的核心目标:
├── 低延迟(Latency):快速响应
├── 高吞吐(Throughput):处理更多请求
├── 低成本(Cost):减少资源消耗
└── 高可用(Availability):稳定服务

1.2 推理架构选择

Text Only
┌─────────────────────────────────────────────────────────────────┐
│                     推理部署架构选择                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. 本地部署(Local Deployment)                                  │
│  ├── 适用:开发测试、个人使用                                     │
│  ├── 工具:Transformers、llama.cpp                               │
│  └── 特点:简单易用,资源有限                                     │
│                                                                  │
│  2. 服务器部署(Server Deployment)                               │
│  ├── 适用:生产环境、API服务                                      │
│  ├── 工具:vLLM、TGI、TensorRT-LLM                               │
│  └── 特点:高性能,支持并发                                       │
│                                                                  │
│  3. 云端部署(Cloud Deployment)                                  │
│  ├── 适用:弹性伸缩、大规模服务                                   │
│  ├── 平台:AWS SageMaker、Azure ML、GCP Vertex AI                │
│  └── 特点:托管服务,自动扩缩容                                   │
│                                                                  │
│  4. 边缘部署(Edge Deployment)                                   │
│  ├── 适用:移动设备、嵌入式系统                                   │
│  ├── 工具:MLC-LLM、Qualcomm AI Stack                            │
│  └── 特点:极致量化,低功耗                                       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

vLLM部署实战

2.1 vLLM核心特性

Text Only
vLLM核心优势:

1. PagedAttention
   ├── 将KV Cache分页管理
   ├── 减少内存碎片
   └── 提高内存利用率

2. 连续批处理(Continuous Batching)
   ├── 动态添加新请求
   ├── 请求完成后立即释放资源
   └── 提高吞吐率

3. 量化支持
   ├── GPTQ、AWQ、SqueezeLLM
   └── 降低内存占用

4. 张量并行
   ├── 多卡推理
   └── 支持大模型

2.2 vLLM安装与基础使用

Bash
# 安装vLLM
pip install vllm

# 对于CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

# 对于CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
Python
# 基础推理示例
from vllm import LLM, SamplingParams

# 加载模型
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=1,  # 单卡
    gpu_memory_utilization=0.9,  # GPU内存使用率
)

# 设置采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# 生成
prompts = [
    "The future of AI is",
    "In the beginning,",
    "Once upon a time,",
]

outputs = llm.generate(prompts, sampling_params)

# 打印结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Generated: {generated_text!r}")
    print("-" * 50)

2.3 vLLM高级配置

Python
from vllm import LLM, SamplingParams

class VLLMDeployment:
    """
    vLLM高级部署配置
    """

    def __init__(self, model_path, config=None):
        self.config = config or {}

        # 初始化LLM
        self.llm = LLM(
            model=model_path,

            # 并行配置
            tensor_parallel_size=self.config.get('tensor_parallel_size', 1),
            pipeline_parallel_size=self.config.get('pipeline_parallel_size', 1),

            # 内存配置
            gpu_memory_utilization=self.config.get('gpu_memory_utilization', 0.9),
            max_num_seqs=self.config.get('max_num_seqs', 256),
            max_model_len=self.config.get('max_model_len', 4096),

            # 量化配置
            quantization=self.config.get('quantization', None),
            # 可选: 'awq', 'gptq', 'squeezellm'

            # 其他配置
            dtype=self.config.get('dtype', 'auto'),
            # 可选: 'float16', 'bfloat16', 'float32'

            trust_remote_code=self.config.get('trust_remote_code', True),
        )

    def generate(self, prompts, **sampling_kwargs):
        """
        批量生成
        """
        sampling_params = SamplingParams(
            temperature=sampling_kwargs.get('temperature', 0.7),
            top_p=sampling_kwargs.get('top_p', 0.9),
            top_k=sampling_kwargs.get('top_k', -1),
            max_tokens=sampling_kwargs.get('max_tokens', 256),
            presence_penalty=sampling_kwargs.get('presence_penalty', 0.0),
            frequency_penalty=sampling_kwargs.get('frequency_penalty', 0.0),
            stop=sampling_kwargs.get('stop', None),
        )

        outputs = self.llm.generate(prompts, sampling_params)

        # 格式化输出
        results = []
        for output in outputs:
            results.append({
                'prompt': output.prompt,
                'text': output.outputs[0].text,
                'tokens': len(output.outputs[0].token_ids),
            })

        return results

    def chat(self, messages, **kwargs):  # **kwargs收集关键字参数
        """
        对话模式
        """
        # 构建prompt
        prompt = self._build_chat_prompt(messages)

        # 生成
        result = self.generate([prompt], **kwargs)

        return result[0]['text']

    def _build_chat_prompt(self, messages):
        """
        构建对话prompt
        """
        # 使用模型特定的chat template
        if hasattr(self.llm.get_tokenizer(), 'apply_chat_template'):  # hasattr检查对象是否有某属性
            prompt = self.llm.get_tokenizer().apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
        else:
            # 简单的对话格式
            prompt = ""
            for msg in messages:
                role = msg['role']
                content = msg['content']
                prompt += f"{role}: {content}\n"
            prompt += "assistant:"

        return prompt

# 配置示例
VLLM_CONFIG = {
    'tensor_parallel_size': 2,  # 2卡并行
    'gpu_memory_utilization': 0.85,
    'max_num_seqs': 128,
    'max_model_len': 8192,
    'quantization': None,
    'dtype': 'bfloat16',
}

# 使用示例
deployment = VLLMDeployment("meta-llama/Llama-2-7b-hf", VLLM_CONFIG)

# 批量生成
results = deployment.generate(
    ["Hello, how are you?", "What is machine learning?"],
    temperature=0.7,
    max_tokens=100
)

# 对话模式
response = deployment.chat([
    {'role': 'user', 'content': 'Explain quantum computing'}
])

2.4 vLLM服务部署

Python
# 启动vLLM OpenAI兼容API服务
# 命令行方式(vLLM 0.6+推荐使用 vllm serve)
"""
vllm serve meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 256 \
    --port 8000

# 旧版本也可使用:
# python -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-2-7b-hf ...
"""

# 使用Python自定义服务
import uvicorn
from fastapi import FastAPI
from vllm import LLM, SamplingParams

app = FastAPI()

# 全局模型实例
llm_engine = None

@app.on_event("startup")
async def startup_event():  # async def定义协程函数
    global llm_engine
    llm_engine = LLM(
        model="meta-llama/Llama-2-7b-hf",
        tensor_parallel_size=2,
    )

@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
    """
    OpenAI兼容的completions API
    """
    sampling_params = SamplingParams(
        temperature=request.temperature,
        max_tokens=request.max_tokens,
        top_p=request.top_p,
    )

    outputs = llm_engine.generate(request.prompt, sampling_params)

    return {
        "id": "cmpl-" + str(uuid.uuid4()),
        "object": "text_completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [
            {
                "text": output.outputs[0].text,
                "index": i,
                "logprobs": None,
                "finish_reason": "stop"
            }
            for i, output in enumerate(outputs)  # enumerate同时获取索引和元素
        ]
    }

@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
    """
    OpenAI兼容的chat completions API
    """
    # 构建prompt
    prompt = build_chat_prompt(request.messages)

    sampling_params = SamplingParams(
        temperature=request.temperature,
        max_tokens=request.max_tokens,
    )

    outputs = llm_engine.generate(prompt, sampling_params)

    return {
        "id": "chatcmpl-" + str(uuid.uuid4()),
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [
            {
                "index": i,
                "message": {
                    "role": "assistant",
                    "content": output.outputs[0].text
                },
                "finish_reason": "stop"
            }
            for i, output in enumerate(outputs)
        ]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Text Generation Inference (TGI)

3.1 TGI简介

Text Only
TGI (Text Generation Inference)
├── 开发者:Hugging Face
├── 特点:
│   ├── 生产级推理服务器
│   ├── 支持连续批处理
│   ├── 支持流式生成
│   ├── 支持Safetensors格式
│   └── 支持量化(GPTQ、AWQ)
└── 适用:生产环境部署

3.2 TGI安装与使用

Bash
# Docker方式安装(推荐)
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $(pwd)/data:/data \
    ghcr.io/huggingface/text-generation-inference:1.4 \
    --model-id meta-llama/Llama-2-7b-hf \
    --num-shard 2 \
    --quantize bitsandbytes

# 本地安装
pip install text-generation
Python
# Python客户端
from text_generation import Client

client = Client("http://localhost:8080")

# 文本生成
text = client.generate(
    "The future of AI is",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
).generated_text

print(text)

# 流式生成
for response in client.generate_stream(
    "Explain machine learning:",
    max_new_tokens=100
):
    print(response.token.text, end="", flush=True)

3.3 TGI高级配置

Bash
# 启动参数说明
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $(pwd)/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-hf \
    --revision main \
    --sharded true \
    --num-shard 2 \
    --quantize bitsandbytes-nf4 \
    --max-input-length 4096 \
    --max-total-tokens 8192 \
    --max-batch-prefill-tokens 16384 \
    --max-batch-total-tokens 32768

# 参数说明:
# --model-id: 模型ID或本地路径
# --sharded: 是否使用模型分片
# --num-shard: 分片数量(GPU数量)
# --quantize: 量化方式 (bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, gptq, awq, eetq)
# --max-input-length: 最大输入长度
# --max-total-tokens: 最大总token数(输入+输出)
# --max-batch-prefill-tokens: 预填充阶段最大batch token数
# --max-batch-total-tokens: 生成阶段最大batch token数

3.4 TGI与vLLM对比

特性 TGI vLLM
PagedAttention
连续批处理
量化支持 GPTQ, AWQ, BnB GPTQ, AWQ, SqueezeLLM
张量并行
流水线并行
流式输出
OpenAI API 部分兼容 ✅ 完整兼容
生产就绪 ✅ 更成熟 ✅ 高性能
易用性 中等

量化部署

4.1 量化方法对比

Text Only
量化方法对比
═══════════════════════════════════════════════════════════════════

方法              精度    压缩比    性能损失    适用场景
─────────────────────────────────────────────────────────────────
FP16             16bit   2x        无          通用
BF16             16bit   2x        无          Ampere+ GPU
INT8             8bit    4x        <1%         通用
INT4 (GPTQ)      4bit    8x        1-3%        本地部署
INT4 (AWQ)       4bit    8x        <1%         高质量需求
NF4 (QLoRA)      4bit    8x        1-2%        微调+推理
FP4              4bit    8x        2-4%        实验性

═══════════════════════════════════════════════════════════════════

4.2 GPTQ量化部署

Python
# 使用AutoGPTQ进行量化
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# 量化配置
quantize_config = BaseQuantizeConfig(
    bits=4,  # 4-bit量化
    group_size=128,  # 分组大小
    desc_act=False,  # 是否降序激活
)

# 加载并量化模型
model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantize_config,
)

# 准备校准数据
calib_data = [
    "auto-gptq is an easy-to-use model quantization library",
    "with user-friendly apis",
    # ... 更多校准数据
]

# 执行量化
model.quantize(calib_data)

# 保存量化模型
model.save_quantized("Llama-2-7b-4bit-gptq")

# 加载量化模型进行推理
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

model = AutoGPTQForCausalLM.from_quantized(
    "Llama-2-7b-4bit-gptq",
    device="cuda:0",
)

# 生成
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

4.3 AWQ量化部署

Python
# AWQ量化(通常比GPTQ质量更好)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# 加载模型
model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "Llama-2-7b-awq"

# 量化配置
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 准备校准数据
examples = [
    tokenizer("auto-gptq is an easy-to-use model quantization library")
    for _ in range(8)
]

# 量化
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data=examples,
)

# 保存
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# 加载量化模型
model = AutoAWQForCausalLM.from_quantized(
    quant_path,
    fuse_layers=True,  # 融合层以加速
)

4.4 vLLM中的量化部署

Python
# vLLM支持多种量化方式

# 1. AWQ量化
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    tensor_parallel_size=1,
)

# 2. GPTQ量化
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    tensor_parallel_size=1,
)

# 3. SqueezeLLM量化
llm = LLM(
    model="path/to/squeezellm/model",
    quantization="squeezellm",
)

API服务封装

5.1 FastAPI封装

Python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import uvicorn
from vllm import LLM, SamplingParams

app = FastAPI(title="LLM Inference API")

# 全局模型
llm = None

class CompletionRequest(BaseModel):  # Pydantic BaseModel:自动数据验证和序列化
    model: str
    prompt: str
    max_tokens: Optional[int] = 256  # Optional表示值可以为None
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 0.9
    top_k: Optional[int] = -1
    stop: Optional[List[str]] = None
    stream: Optional[bool] = False

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[dict]
    max_tokens: Optional[int] = 256
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 0.9
    stream: Optional[bool] = False

@app.on_event("startup")
async def load_model():
    global llm
    llm = LLM(
        model="meta-llama/Llama-2-7b-hf",
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
    )
    print("Model loaded successfully")

@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
    try:  # try/except捕获异常,防止程序崩溃
        sampling_params = SamplingParams(
            temperature=request.temperature,
            top_p=request.top_p,
            top_k=request.top_k,
            max_tokens=request.max_tokens,
            stop=request.stop,
        )

        outputs = llm.generate(request.prompt, sampling_params)

        return {
            "id": f"cmpl-{uuid.uuid4()}",
            "object": "text_completion",
            "created": int(time.time()),
            "model": request.model,
            "choices": [
                {
                    "text": output.outputs[0].text,
                    "index": i,
                    "logprobs": None,
                    "finish_reason": "stop"
                }
                for i, output in enumerate(outputs)
            ],
            "usage": {
                "prompt_tokens": len(outputs[0].prompt_token_ids),
                "completion_tokens": len(outputs[0].outputs[0].token_ids),
                "total_tokens": len(outputs[0].prompt_token_ids) + len(outputs[0].outputs[0].token_ids)
            }
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
    try:
        # 构建chat prompt
        prompt = build_chat_prompt(request.messages)

        sampling_params = SamplingParams(
            temperature=request.temperature,
            top_p=request.top_p,
            max_tokens=request.max_tokens,
        )

        outputs = llm.generate(prompt, sampling_params)

        return {
            "id": f"chatcmpl-{uuid.uuid4()}",
            "object": "chat.completion",
            "created": int(time.time()),
            "model": request.model,
            "choices": [
                {
                    "index": i,
                    "message": {
                        "role": "assistant",
                        "content": output.outputs[0].text
                    },
                    "finish_reason": "stop"
                }
                for i, output in enumerate(outputs)
            ],
            "usage": {
                "prompt_tokens": len(outputs[0].prompt_token_ids),
                "completion_tokens": len(outputs[0].outputs[0].token_ids),
                "total_tokens": len(outputs[0].prompt_token_ids) + len(outputs[0].outputs[0].token_ids)
            }
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

def build_chat_prompt(messages):
    """构建chat prompt"""
    prompt = ""
    for msg in messages:
        role = msg.get('role', '')
        content = msg.get('content', '')
        if role == 'system':
            prompt += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == 'user':
            prompt += f"{content} [/INST]"
        elif role == 'assistant':
            prompt += f" {content} </s><s>[INST] "
    return prompt

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model_loaded": llm is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

5.2 客户端调用示例

Python
import requests
import json

class LLMClient:
    """
    LLM API客户端
    """
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url

    def complete(self, prompt, **kwargs):
        """
        文本补全
        """
        response = requests.post(
            f"{self.base_url}/v1/completions",
            json={
                "model": "llama-2-7b",
                "prompt": prompt,
                **kwargs
            }
        )
        return response.json()

    def chat(self, messages, **kwargs):
        """
        对话
        """
        response = requests.post(
            f"{self.base_url}/v1/chat/completions",
            json={
                "model": "llama-2-7b",
                "messages": messages,
                **kwargs
            }
        )
        return response.json()

# 使用示例
client = LLMClient()

# 文本补全
result = client.complete(
    "The future of AI is",
    max_tokens=100,
    temperature=0.7
)
print(result['choices'][0]['text'])

# 对话
result = client.chat([
    {"role": "user", "content": "Explain quantum computing"}
])
print(result['choices'][0]['message']['content'])

性能优化与监控

6.1 性能指标

Python
class PerformanceMonitor:
    """
    性能监控器
    """

    def __init__(self):
        self.metrics = {
            'latency': [],  # 延迟(秒)
            'throughput': [],  # 吞吐量(tokens/秒)
            'queue_size': [],  # 队列大小
        }

    def measure_latency(self, func, *args, **kwargs):  # func为传入的函数,*args/**kwargs透传任意参数,实现通用性能测量
        """
        测量延迟
        """
        import time

        start = time.time()
        result = func(*args, **kwargs)  # 将收集的参数解包后原样传给被测量的函数
        end = time.time()

        latency = end - start
        self.metrics['latency'].append(latency)

        return result, latency

    def calculate_throughput(self, num_tokens, latency):
        """
        计算吞吐量
        """
        throughput = num_tokens / latency
        self.metrics['throughput'].append(throughput)
        return throughput

    def get_statistics(self):
        """
        获取统计信息
        """
        import numpy as np

        stats = {}
        for key, values in self.metrics.items():
            if values:
                stats[key] = {
                    'mean': np.mean(values),
                    'median': np.median(values),
                    'p95': np.percentile(values, 95),
                    'p99': np.percentile(values, 99),
                    'min': np.min(values),
                    'max': np.max(values),
                }

        return stats

# 基准测试
def benchmark_inference(model, prompts, max_tokens=100):
    """
    推理基准测试
    """
    monitor = PerformanceMonitor()

    for prompt in prompts:
        # 测量延迟
        outputs, latency = monitor.measure_latency(
            model.generate,
            prompt,
            max_tokens=max_tokens
        )

        # 计算吞吐量
        num_tokens = len(outputs[0].outputs[0].token_ids)
        throughput = monitor.calculate_throughput(num_tokens, latency)

        print(f"Prompt: {prompt[:50]}...")
        print(f"Latency: {latency:.3f}s")
        print(f"Throughput: {throughput:.1f} tokens/s")
        print("-" * 50)

    # 打印统计
    stats = monitor.get_statistics()
    print("\n=== Benchmark Results ===")
    for metric, values in stats.items():
        print(f"\n{metric.upper()}:")
        for stat, value in values.items():
            print(f"  {stat}: {value:.3f}")

6.2 优化建议

Text Only
推理优化建议
═══════════════════════════════════════════════════════════════════

1. 批处理优化
├── 使用动态批处理(continuous batching)
├── 设置合适的max_num_seqs
└── 调整max_batch_total_tokens

2. 内存优化
├── 使用量化(INT8/INT4)
├── 启用KV Cache分页(vLLM)
├── 调整gpu_memory_utilization
└── 使用梯度检查点(如果同时训练)

3. 并行优化
├── 张量并行(多卡)
├── 流水线并行(超大模型)
└── 数据并行(多实例)

4. 编译优化
├── 使用Torch.compile(PyTorch 2.0+)
├── 使用TensorRT-LLM
└── 使用ONNX Runtime

5. 网络优化
├── 使用gRPC代替HTTP
├── 启用压缩
└── 使用CDN加速模型下载

═══════════════════════════════════════════════════════════════════

总结

部署方案选择指南

场景 推荐方案 理由
本地开发 Transformers / llama.cpp 简单易用
生产API vLLM / TGI 高性能,稳定
超大模型 vLLM + 张量并行 支持多卡
边缘设备 llama.cpp / MLC-LLM 极致量化
快速原型 HuggingFace Inference API 无需部署

部署检查清单

  • 模型格式正确(Safetensors推荐)
  • 量化配置适当
  • 批处理参数调优
  • 内存使用监控
  • 错误处理机制
  • 日志记录完善
  • 健康检查接口
  • 自动扩缩容策略

最后更新日期:2026-02-12 适用版本:LLM学习教程 v2026

下一步:学习04-对齐技术,掌握RLHF和DPO等对齐技术!