03 - 推理服务部署(全面版)¶
⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。
📌 定位说明:本章侧重部署架构与推理框架深度对比。 - 📖 应用开发者快速部署指南请参考 LLM应用/11-大模型部署 - 📖 MLOps全链路工程化部署请参考 MLOps与AI工程化/02-模型部署与服务化
学习目标:掌握大模型推理服务的部署技术,包括vLLM、TGI、量化部署和API服务封装。
目录¶
推理部署概述¶
1.1 推理 vs 训练¶
Text Only
训练阶段 vs 推理阶段
训练阶段:
├── 目标:更新模型参数
├── 计算:前向 + 反向传播
├── 内存:存储参数、梯度、优化器状态
├── 批处理:大batch,追求吞吐
└── 精度:FP32/BF16/FP16
推理阶段:
├── 目标:生成预测结果
├── 计算:仅前向传播
├── 内存:只需模型参数 + KV Cache
├── 批处理:动态batch,追求延迟
└── 精度:可量化到INT8/INT4
推理优化的核心目标:
├── 低延迟(Latency):快速响应
├── 高吞吐(Throughput):处理更多请求
├── 低成本(Cost):减少资源消耗
└── 高可用(Availability):稳定服务
1.2 推理架构选择¶
Text Only
┌─────────────────────────────────────────────────────────────────┐
│ 推理部署架构选择 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. 本地部署(Local Deployment) │
│ ├── 适用:开发测试、个人使用 │
│ ├── 工具:Transformers、llama.cpp │
│ └── 特点:简单易用,资源有限 │
│ │
│ 2. 服务器部署(Server Deployment) │
│ ├── 适用:生产环境、API服务 │
│ ├── 工具:vLLM、TGI、TensorRT-LLM │
│ └── 特点:高性能,支持并发 │
│ │
│ 3. 云端部署(Cloud Deployment) │
│ ├── 适用:弹性伸缩、大规模服务 │
│ ├── 平台:AWS SageMaker、Azure ML、GCP Vertex AI │
│ └── 特点:托管服务,自动扩缩容 │
│ │
│ 4. 边缘部署(Edge Deployment) │
│ ├── 适用:移动设备、嵌入式系统 │
│ ├── 工具:MLC-LLM、Qualcomm AI Stack │
│ └── 特点:极致量化,低功耗 │
│ │
└─────────────────────────────────────────────────────────────────┘
vLLM部署实战¶
2.1 vLLM核心特性¶
Text Only
vLLM核心优势:
1. PagedAttention
├── 将KV Cache分页管理
├── 减少内存碎片
└── 提高内存利用率
2. 连续批处理(Continuous Batching)
├── 动态添加新请求
├── 请求完成后立即释放资源
└── 提高吞吐率
3. 量化支持
├── GPTQ、AWQ、SqueezeLLM
└── 降低内存占用
4. 张量并行
├── 多卡推理
└── 支持大模型
2.2 vLLM安装与基础使用¶
Bash
# 安装vLLM
pip install vllm
# 对于CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
# 对于CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
Python
# 基础推理示例
from vllm import LLM, SamplingParams
# 加载模型
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1, # 单卡
gpu_memory_utilization=0.9, # GPU内存使用率
)
# 设置采样参数
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
)
# 生成
prompts = [
"The future of AI is",
"In the beginning,",
"Once upon a time,",
]
outputs = llm.generate(prompts, sampling_params)
# 打印结果
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Generated: {generated_text!r}")
print("-" * 50)
2.3 vLLM高级配置¶
Python
from vllm import LLM, SamplingParams
class VLLMDeployment:
"""
vLLM高级部署配置
"""
def __init__(self, model_path, config=None):
self.config = config or {}
# 初始化LLM
self.llm = LLM(
model=model_path,
# 并行配置
tensor_parallel_size=self.config.get('tensor_parallel_size', 1),
pipeline_parallel_size=self.config.get('pipeline_parallel_size', 1),
# 内存配置
gpu_memory_utilization=self.config.get('gpu_memory_utilization', 0.9),
max_num_seqs=self.config.get('max_num_seqs', 256),
max_model_len=self.config.get('max_model_len', 4096),
# 量化配置
quantization=self.config.get('quantization', None),
# 可选: 'awq', 'gptq', 'squeezellm'
# 其他配置
dtype=self.config.get('dtype', 'auto'),
# 可选: 'float16', 'bfloat16', 'float32'
trust_remote_code=self.config.get('trust_remote_code', True),
)
def generate(self, prompts, **sampling_kwargs):
"""
批量生成
"""
sampling_params = SamplingParams(
temperature=sampling_kwargs.get('temperature', 0.7),
top_p=sampling_kwargs.get('top_p', 0.9),
top_k=sampling_kwargs.get('top_k', -1),
max_tokens=sampling_kwargs.get('max_tokens', 256),
presence_penalty=sampling_kwargs.get('presence_penalty', 0.0),
frequency_penalty=sampling_kwargs.get('frequency_penalty', 0.0),
stop=sampling_kwargs.get('stop', None),
)
outputs = self.llm.generate(prompts, sampling_params)
# 格式化输出
results = []
for output in outputs:
results.append({
'prompt': output.prompt,
'text': output.outputs[0].text,
'tokens': len(output.outputs[0].token_ids),
})
return results
def chat(self, messages, **kwargs): # **kwargs收集关键字参数
"""
对话模式
"""
# 构建prompt
prompt = self._build_chat_prompt(messages)
# 生成
result = self.generate([prompt], **kwargs)
return result[0]['text']
def _build_chat_prompt(self, messages):
"""
构建对话prompt
"""
# 使用模型特定的chat template
if hasattr(self.llm.get_tokenizer(), 'apply_chat_template'): # hasattr检查对象是否有某属性
prompt = self.llm.get_tokenizer().apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
else:
# 简单的对话格式
prompt = ""
for msg in messages:
role = msg['role']
content = msg['content']
prompt += f"{role}: {content}\n"
prompt += "assistant:"
return prompt
# 配置示例
VLLM_CONFIG = {
'tensor_parallel_size': 2, # 2卡并行
'gpu_memory_utilization': 0.85,
'max_num_seqs': 128,
'max_model_len': 8192,
'quantization': None,
'dtype': 'bfloat16',
}
# 使用示例
deployment = VLLMDeployment("meta-llama/Llama-2-7b-hf", VLLM_CONFIG)
# 批量生成
results = deployment.generate(
["Hello, how are you?", "What is machine learning?"],
temperature=0.7,
max_tokens=100
)
# 对话模式
response = deployment.chat([
{'role': 'user', 'content': 'Explain quantum computing'}
])
2.4 vLLM服务部署¶
Python
# 启动vLLM OpenAI兼容API服务
# 命令行方式(vLLM 0.6+推荐使用 vllm serve)
"""
vllm serve meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 256 \
--port 8000
# 旧版本也可使用:
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-2-7b-hf ...
"""
# 使用Python自定义服务
import uvicorn
from fastapi import FastAPI
from vllm import LLM, SamplingParams
app = FastAPI()
# 全局模型实例
llm_engine = None
@app.on_event("startup")
async def startup_event(): # async def定义协程函数
global llm_engine
llm_engine = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2,
)
@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
"""
OpenAI兼容的completions API
"""
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens,
top_p=request.top_p,
)
outputs = llm_engine.generate(request.prompt, sampling_params)
return {
"id": "cmpl-" + str(uuid.uuid4()),
"object": "text_completion",
"created": int(time.time()),
"model": request.model,
"choices": [
{
"text": output.outputs[0].text,
"index": i,
"logprobs": None,
"finish_reason": "stop"
}
for i, output in enumerate(outputs) # enumerate同时获取索引和元素
]
}
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
"""
OpenAI兼容的chat completions API
"""
# 构建prompt
prompt = build_chat_prompt(request.messages)
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens,
)
outputs = llm_engine.generate(prompt, sampling_params)
return {
"id": "chatcmpl-" + str(uuid.uuid4()),
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [
{
"index": i,
"message": {
"role": "assistant",
"content": output.outputs[0].text
},
"finish_reason": "stop"
}
for i, output in enumerate(outputs)
]
}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Text Generation Inference (TGI)¶
3.1 TGI简介¶
Text Only
TGI (Text Generation Inference)
├── 开发者:Hugging Face
├── 特点:
│ ├── 生产级推理服务器
│ ├── 支持连续批处理
│ ├── 支持流式生成
│ ├── 支持Safetensors格式
│ └── 支持量化(GPTQ、AWQ)
└── 适用:生产环境部署
3.2 TGI安装与使用¶
Bash
# Docker方式安装(推荐)
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $(pwd)/data:/data \
ghcr.io/huggingface/text-generation-inference:1.4 \
--model-id meta-llama/Llama-2-7b-hf \
--num-shard 2 \
--quantize bitsandbytes
# 本地安装
pip install text-generation
Python
# Python客户端
from text_generation import Client
client = Client("http://localhost:8080")
# 文本生成
text = client.generate(
"The future of AI is",
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
).generated_text
print(text)
# 流式生成
for response in client.generate_stream(
"Explain machine learning:",
max_new_tokens=100
):
print(response.token.text, end="", flush=True)
3.3 TGI高级配置¶
Bash
# 启动参数说明
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $(pwd)/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-hf \
--revision main \
--sharded true \
--num-shard 2 \
--quantize bitsandbytes-nf4 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384 \
--max-batch-total-tokens 32768
# 参数说明:
# --model-id: 模型ID或本地路径
# --sharded: 是否使用模型分片
# --num-shard: 分片数量(GPU数量)
# --quantize: 量化方式 (bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, gptq, awq, eetq)
# --max-input-length: 最大输入长度
# --max-total-tokens: 最大总token数(输入+输出)
# --max-batch-prefill-tokens: 预填充阶段最大batch token数
# --max-batch-total-tokens: 生成阶段最大batch token数
3.4 TGI与vLLM对比¶
| 特性 | TGI | vLLM |
|---|---|---|
| PagedAttention | ✅ | ✅ |
| 连续批处理 | ✅ | ✅ |
| 量化支持 | GPTQ, AWQ, BnB | GPTQ, AWQ, SqueezeLLM |
| 张量并行 | ✅ | ✅ |
| 流水线并行 | ❌ | ✅ |
| 流式输出 | ✅ | ✅ |
| OpenAI API | 部分兼容 | ✅ 完整兼容 |
| 生产就绪 | ✅ 更成熟 | ✅ 高性能 |
| 易用性 | 中等 | 高 |
量化部署¶
4.1 量化方法对比¶
Text Only
量化方法对比
═══════════════════════════════════════════════════════════════════
方法 精度 压缩比 性能损失 适用场景
─────────────────────────────────────────────────────────────────
FP16 16bit 2x 无 通用
BF16 16bit 2x 无 Ampere+ GPU
INT8 8bit 4x <1% 通用
INT4 (GPTQ) 4bit 8x 1-3% 本地部署
INT4 (AWQ) 4bit 8x <1% 高质量需求
NF4 (QLoRA) 4bit 8x 1-2% 微调+推理
FP4 4bit 8x 2-4% 实验性
═══════════════════════════════════════════════════════════════════
4.2 GPTQ量化部署¶
Python
# 使用AutoGPTQ进行量化
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
# 量化配置
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit量化
group_size=128, # 分组大小
desc_act=False, # 是否降序激活
)
# 加载并量化模型
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config,
)
# 准备校准数据
calib_data = [
"auto-gptq is an easy-to-use model quantization library",
"with user-friendly apis",
# ... 更多校准数据
]
# 执行量化
model.quantize(calib_data)
# 保存量化模型
model.save_quantized("Llama-2-7b-4bit-gptq")
# 加载量化模型进行推理
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoGPTQForCausalLM.from_quantized(
"Llama-2-7b-4bit-gptq",
device="cuda:0",
)
# 生成
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
4.3 AWQ量化部署¶
Python
# AWQ量化(通常比GPTQ质量更好)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# 加载模型
model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "Llama-2-7b-awq"
# 量化配置
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 准备校准数据
examples = [
tokenizer("auto-gptq is an easy-to-use model quantization library")
for _ in range(8)
]
# 量化
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=examples,
)
# 保存
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
# 加载量化模型
model = AutoAWQForCausalLM.from_quantized(
quant_path,
fuse_layers=True, # 融合层以加速
)
4.4 vLLM中的量化部署¶
Python
# vLLM支持多种量化方式
# 1. AWQ量化
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
tensor_parallel_size=1,
)
# 2. GPTQ量化
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
tensor_parallel_size=1,
)
# 3. SqueezeLLM量化
llm = LLM(
model="path/to/squeezellm/model",
quantization="squeezellm",
)
API服务封装¶
5.1 FastAPI封装¶
Python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import uvicorn
from vllm import LLM, SamplingParams
app = FastAPI(title="LLM Inference API")
# 全局模型
llm = None
class CompletionRequest(BaseModel): # Pydantic BaseModel:自动数据验证和序列化
model: str
prompt: str
max_tokens: Optional[int] = 256 # Optional表示值可以为None
temperature: Optional[float] = 0.7
top_p: Optional[float] = 0.9
top_k: Optional[int] = -1
stop: Optional[List[str]] = None
stream: Optional[bool] = False
class ChatCompletionRequest(BaseModel):
model: str
messages: List[dict]
max_tokens: Optional[int] = 256
temperature: Optional[float] = 0.7
top_p: Optional[float] = 0.9
stream: Optional[bool] = False
@app.on_event("startup")
async def load_model():
global llm
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
)
print("Model loaded successfully")
@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
try: # try/except捕获异常,防止程序崩溃
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
top_k=request.top_k,
max_tokens=request.max_tokens,
stop=request.stop,
)
outputs = llm.generate(request.prompt, sampling_params)
return {
"id": f"cmpl-{uuid.uuid4()}",
"object": "text_completion",
"created": int(time.time()),
"model": request.model,
"choices": [
{
"text": output.outputs[0].text,
"index": i,
"logprobs": None,
"finish_reason": "stop"
}
for i, output in enumerate(outputs)
],
"usage": {
"prompt_tokens": len(outputs[0].prompt_token_ids),
"completion_tokens": len(outputs[0].outputs[0].token_ids),
"total_tokens": len(outputs[0].prompt_token_ids) + len(outputs[0].outputs[0].token_ids)
}
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
try:
# 构建chat prompt
prompt = build_chat_prompt(request.messages)
sampling_params = SamplingParams(
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens,
)
outputs = llm.generate(prompt, sampling_params)
return {
"id": f"chatcmpl-{uuid.uuid4()}",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [
{
"index": i,
"message": {
"role": "assistant",
"content": output.outputs[0].text
},
"finish_reason": "stop"
}
for i, output in enumerate(outputs)
],
"usage": {
"prompt_tokens": len(outputs[0].prompt_token_ids),
"completion_tokens": len(outputs[0].outputs[0].token_ids),
"total_tokens": len(outputs[0].prompt_token_ids) + len(outputs[0].outputs[0].token_ids)
}
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
def build_chat_prompt(messages):
"""构建chat prompt"""
prompt = ""
for msg in messages:
role = msg.get('role', '')
content = msg.get('content', '')
if role == 'system':
prompt += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
elif role == 'user':
prompt += f"{content} [/INST]"
elif role == 'assistant':
prompt += f" {content} </s><s>[INST] "
return prompt
@app.get("/health")
async def health_check():
return {"status": "healthy", "model_loaded": llm is not None}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
5.2 客户端调用示例¶
Python
import requests
import json
class LLMClient:
"""
LLM API客户端
"""
def __init__(self, base_url="http://localhost:8000"):
self.base_url = base_url
def complete(self, prompt, **kwargs):
"""
文本补全
"""
response = requests.post(
f"{self.base_url}/v1/completions",
json={
"model": "llama-2-7b",
"prompt": prompt,
**kwargs
}
)
return response.json()
def chat(self, messages, **kwargs):
"""
对话
"""
response = requests.post(
f"{self.base_url}/v1/chat/completions",
json={
"model": "llama-2-7b",
"messages": messages,
**kwargs
}
)
return response.json()
# 使用示例
client = LLMClient()
# 文本补全
result = client.complete(
"The future of AI is",
max_tokens=100,
temperature=0.7
)
print(result['choices'][0]['text'])
# 对话
result = client.chat([
{"role": "user", "content": "Explain quantum computing"}
])
print(result['choices'][0]['message']['content'])
性能优化与监控¶
6.1 性能指标¶
Python
class PerformanceMonitor:
"""
性能监控器
"""
def __init__(self):
self.metrics = {
'latency': [], # 延迟(秒)
'throughput': [], # 吞吐量(tokens/秒)
'queue_size': [], # 队列大小
}
def measure_latency(self, func, *args, **kwargs): # func为传入的函数,*args/**kwargs透传任意参数,实现通用性能测量
"""
测量延迟
"""
import time
start = time.time()
result = func(*args, **kwargs) # 将收集的参数解包后原样传给被测量的函数
end = time.time()
latency = end - start
self.metrics['latency'].append(latency)
return result, latency
def calculate_throughput(self, num_tokens, latency):
"""
计算吞吐量
"""
throughput = num_tokens / latency
self.metrics['throughput'].append(throughput)
return throughput
def get_statistics(self):
"""
获取统计信息
"""
import numpy as np
stats = {}
for key, values in self.metrics.items():
if values:
stats[key] = {
'mean': np.mean(values),
'median': np.median(values),
'p95': np.percentile(values, 95),
'p99': np.percentile(values, 99),
'min': np.min(values),
'max': np.max(values),
}
return stats
# 基准测试
def benchmark_inference(model, prompts, max_tokens=100):
"""
推理基准测试
"""
monitor = PerformanceMonitor()
for prompt in prompts:
# 测量延迟
outputs, latency = monitor.measure_latency(
model.generate,
prompt,
max_tokens=max_tokens
)
# 计算吞吐量
num_tokens = len(outputs[0].outputs[0].token_ids)
throughput = monitor.calculate_throughput(num_tokens, latency)
print(f"Prompt: {prompt[:50]}...")
print(f"Latency: {latency:.3f}s")
print(f"Throughput: {throughput:.1f} tokens/s")
print("-" * 50)
# 打印统计
stats = monitor.get_statistics()
print("\n=== Benchmark Results ===")
for metric, values in stats.items():
print(f"\n{metric.upper()}:")
for stat, value in values.items():
print(f" {stat}: {value:.3f}")
6.2 优化建议¶
Text Only
推理优化建议
═══════════════════════════════════════════════════════════════════
1. 批处理优化
├── 使用动态批处理(continuous batching)
├── 设置合适的max_num_seqs
└── 调整max_batch_total_tokens
2. 内存优化
├── 使用量化(INT8/INT4)
├── 启用KV Cache分页(vLLM)
├── 调整gpu_memory_utilization
└── 使用梯度检查点(如果同时训练)
3. 并行优化
├── 张量并行(多卡)
├── 流水线并行(超大模型)
└── 数据并行(多实例)
4. 编译优化
├── 使用Torch.compile(PyTorch 2.0+)
├── 使用TensorRT-LLM
└── 使用ONNX Runtime
5. 网络优化
├── 使用gRPC代替HTTP
├── 启用压缩
└── 使用CDN加速模型下载
═══════════════════════════════════════════════════════════════════
总结¶
部署方案选择指南¶
| 场景 | 推荐方案 | 理由 |
|---|---|---|
| 本地开发 | Transformers / llama.cpp | 简单易用 |
| 生产API | vLLM / TGI | 高性能,稳定 |
| 超大模型 | vLLM + 张量并行 | 支持多卡 |
| 边缘设备 | llama.cpp / MLC-LLM | 极致量化 |
| 快速原型 | HuggingFace Inference API | 无需部署 |
部署检查清单¶
- 模型格式正确(Safetensors推荐)
- 量化配置适当
- 批处理参数调优
- 内存使用监控
- 错误处理机制
- 日志记录完善
- 健康检查接口
- 自动扩缩容策略
最后更新日期:2026-02-12 适用版本:LLM学习教程 v2026
下一步:学习04-对齐技术,掌握RLHF和DPO等对齐技术!