第14章视觉模型部署实战¶

⚠️ 时效性说明：本章涉及前沿模型/价格/榜单等信息，可能随版本快速变化；请以论文原文、官方发布页和 API 文档为准。

📚 章节概述¶

本章系统讲解视觉模型从训练到生产部署的完整流程，覆盖ONNX导出、TensorRT加速、Triton推理服务、移动端部署（NCNN/MNN/TFLite）、以及性能调优最佳实践。

学习时间：5-6天 难度等级：⭐⭐⭐⭐⭐ 前置知识：PyTorch基础、计算机视觉基础

📎 交叉引用： - FlashAttention → 12-FlashAttention原理与实现 - Speculative Decoding → 13-推测解码与推理加速 - C++ TensorRT → C++开发/18-SIMD与AI推理引擎 - CUDA优化 → 底层系统/05-GPU并行计算/06-CUDA高级优化

14.1 部署Pipeline总览¶

14.1.1 从训练到部署¶

Text Only

训练 → 导出 → 优化 → 部署 → 监控

┌─────────┐    ┌─────────┐    ┌──────────┐    ┌───────────┐    ┌────────┐
│ PyTorch │ →  │  ONNX   │ →  │ TensorRT │ →  │  Triton   │ →  │ 监控   │
│ 模型    │    │ 中间格式 │    │ 引擎优化 │    │ 推理服务  │    │ Grafana│
└─────────┘    └─────────┘    └──────────┘    └───────────┘    └────────┘
      │              │              │               │
  model.pt       model.onnx    model.engine    HTTP/gRPC API

每个阶段可选工具:
  导出: torch.onnx.export, torch.jit.trace
  优化: TensorRT, OpenVINO, ONNX Runtime
  服务: Triton, TorchServe, BentoML, vLLM
  监控: Prometheus + Grafana

14.1.2 部署目标平台¶

平台	推理引擎	适用场景
NVIDIA GPU (数据中心)	TensorRT	服务器推理，最高性能
NVIDIA Jetson	TensorRT + DeepStream	边缘推理
Intel CPU	OpenVINO	无GPU环境
通用CPU/GPU	ONNX Runtime	跨平台兼容
Android	NCNN / MNN / TFLite	移动端
iOS	Core ML / NCNN	Apple生态

14.2 PyTorch模型导出ONNX¶

14.2.1 基础导出¶

Python

import torch
import torchvision

# 加载模型
# torchvision 0.13+ 推荐使用枚举类型而非字符串
from torchvision.models import ResNet50_Weights
model = torchvision.models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
# 旧版本兼容写法：model = torchvision.models.resnet50(weights="IMAGENET1K_V2")
model.eval()  # eval()评估模式

# 创建虚拟输入
dummy_input = torch.randn(1, 3, 224, 224)

# 基础导出
torch.onnx.export(
    model,
    dummy_input,
    "resnet50.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=17,                    # 使用较新的opset
    dynamic_axes={                       # 动态batch
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    do_constant_folding=True,            # 常量折叠优化
)
print("✅ ONNX导出成功")

14.2.2 检测模型导出（YOLOv8）¶

Python

from ultralytics import YOLO

# 加载YOLOv8模型
model = YOLO("yolov8n.pt")  # n/s/m/l/x

# 导出ONNX（带NMS后处理）
model.export(
    format="onnx",
    imgsz=640,
    dynamic=True,        # 动态batch
    simplify=True,       # onnx-simplifier优化
    opset=17,
    half=False,          # FP16权重
)

14.2.3 ONNX模型验证与优化¶

Python

import onnx
import onnxruntime as ort
import numpy as np

# 1. 模型验证
model = onnx.load("resnet50.onnx")
onnx.checker.check_model(model)
print(f"✅ ONNX模型验证通过")
print(f"  输入: {[i.name for i in model.graph.input]}")
print(f"  输出: {[o.name for o in model.graph.output]}")

# 2. 模型简化（消除冗余节点）
import onnxsim
model_simplified, check = onnxsim.simplify(model)
assert check, "简化失败"  # assert断言
onnx.save(model_simplified, "resnet50_sim.onnx")

# 3. 精度验证：对比PyTorch和ONNX输出
session = ort.InferenceSession("resnet50.onnx")
dummy_np = np.random.randn(1, 3, 224, 224).astype(np.float32)

# ONNX Runtime推理
ort_output = session.run(None, {"input": dummy_np})[0]

# PyTorch推理
import torchvision
pt_model = torchvision.models.resnet50(weights="IMAGENET1K_V2")
pt_model.eval()
with torch.no_grad():  # 禁用梯度计算，节省内存
    pt_output = pt_model(torch.from_numpy(dummy_np)).numpy()

# 对比
max_diff = np.abs(ort_output - pt_output).max()
print(f"  最大精度差异: {max_diff:.6f}")
assert max_diff < 1e-5, "精度验证失败！"
print("✅ 精度验证通过")

📝 面试考点：ONNX导出时opset_version的作用？dynamic_axes有什么用？如何验证导出精度？

14.3 TensorRT Python API推理¶

14.3.1 TensorRT Python构建Engine¶

Python

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_path, engine_path, fp16=True, int8=False,
                 max_batch=16, workspace_gb=2):
    """从ONNX构建TensorRT Engine"""
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, TRT_LOGGER)

    # 解析ONNX
    with open(onnx_path, 'rb') as f:  # with自动管理文件关闭
        if not parser.parse(f.read()):
            for i in range(parser.num_errors):
                print(f"Error: {parser.get_error(i)}")
            return None

    # 构建配置
    config = builder.create_builder_config()
    config.set_memory_pool_limit(
        trt.MemoryPoolType.WORKSPACE, workspace_gb << 30)

    if fp16:
        config.set_flag(trt.BuilderFlag.FP16)
    if int8:
        config.set_flag(trt.BuilderFlag.INT8)
        # 需要设置校准器（参见INT8校准部分）

    # 动态shape profile
    profile = builder.create_optimization_profile()
    profile.set_shape("input",
                      min=(1, 3, 224, 224),
                      opt=(8, 3, 224, 224),
                      max=(max_batch, 3, 224, 224))
    config.add_optimization_profile(profile)

    # 构建Engine
    print("Building TensorRT engine (this may take a few minutes)...")
    serialized = builder.build_serialized_network(network, config)

    # 保存
    with open(engine_path, 'wb') as f:
        f.write(serialized)

    print(f"✅ Engine saved to {engine_path}")
    return serialized

class TRTInference:
    """
    TensorRT推理封装

    **版本要求**：
    - TensorRT >= 8.6.0（使用 num_io_tensors API）
    - 对于 TensorRT < 8.6，请使用 TRTInferenceLegacy 类

    **API变更说明**：
    - TensorRT 8.6+ 引入了 `num_io_tensors` 和 `get_tensor_name` API
    - 早期版本使用 `num_bindings` 和 `get_binding_name` API
    """

    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)

        # 反序列化Engine
        runtime = trt.Runtime(self.logger)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.stream = cuda.Stream()

        # 分析输入输出（TensorRT 8.6+ API）
        self.inputs = []
        self.outputs = []
        self.bindings = {}

        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            mode = self.engine.get_tensor_mode(name)

            if mode == trt.TensorIOMode.INPUT:
                self.inputs.append({"name": name, "dtype": dtype})
            else:
                self.outputs.append({"name": name, "dtype": dtype})


class TRTInferenceLegacy:
    """
    TensorRT推理封装（兼容 TensorRT 8.5 及更早版本）

    **适用版本**：TensorRT < 8.6.0
    **主要差异**：使用 `num_bindings` 和 `get_binding_name` API
    """

    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)

        # 反序列化Engine
        runtime = trt.Runtime(self.logger)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.stream = cuda.Stream()

        # 分析输入输出（TensorRT 8.5 及更早版本 API）
        self.inputs = []
        self.outputs = []
        self.bindings = []

        for i in range(self.engine.num_bindings):
            name = self.engine.get_binding_name(i)
            dtype = trt.nptype(self.engine.get_binding_dtype(i))

            if self.engine.binding_is_input(i):
                self.inputs.append({"name": name, "dtype": dtype, "index": i})
            else:
                self.outputs.append({"name": name, "dtype": dtype, "index": i})

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        """执行推理（兼容旧版API）"""
        # 分配GPU内存
        d_input = cuda.mem_alloc(input_data.nbytes)

        # 获取输出shape
        output_info = self.outputs[0]
        output_shape = self.context.get_binding_shape(output_info["index"])
        output_data = np.empty(output_shape, dtype=output_info["dtype"])
        d_output = cuda.mem_alloc(output_data.nbytes)

        # 设置bindings
        self.bindings = [None] * self.engine.num_bindings
        self.bindings[self.inputs[0]["index"]] = int(d_input)
        self.bindings[output_info["index"]] = int(d_output)

        # 拷贝输入到GPU
        cuda.memcpy_htod_async(d_input, input_data.ravel(), self.stream)

        # 执行推理
        self.context.execute_async_v2(self.bindings, self.stream.handle)

        # 拷贝输出到CPU
        cuda.memcpy_dtoh_async(output_data, d_output, self.stream)
        self.stream.synchronize()

        # 释放GPU内存
        d_input.free()
        d_output.free()

        return output_data


def create_trt_inference(engine_path):
    """
    自动选择合适的TensorRT推理类

    根据TensorRT版本自动选择 TRTInference 或 TRTInferenceLegacy

    Args:
        engine_path: TensorRT引擎文件路径

    Returns:
        TRTInference 或 TRTInferenceLegacy 实例
    """
    trt_version = tuple(map(int, trt.__version__.split('.')))

    if trt_version >= (8, 6, 0):
        print(f"✅ 使用 TensorRT {trt.__version__} (>= 8.6) API")
        return TRTInference(engine_path)
    else:
        print(f"⚠️ 使用 TensorRT {trt.__version__} (< 8.6) 兼容API")
        return TRTInferenceLegacy(engine_path)

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        """执行推理"""
        batch_size = input_data.shape[0]

        # 设置动态shape
        input_name = self.inputs[0]["name"]
        self.context.set_input_shape(input_name, input_data.shape)

        # 分配GPU内存
        d_input = cuda.mem_alloc(input_data.nbytes)

        # 获取输出shape
        output_name = self.outputs[0]["name"]
        output_shape = self.context.get_tensor_shape(output_name)
        output_data = np.empty(output_shape,
                               dtype=self.outputs[0]["dtype"])
        d_output = cuda.mem_alloc(output_data.nbytes)

        # 拷贝输入到GPU
        cuda.memcpy_htod_async(d_input, input_data.ravel(), self.stream)

        # 设置tensor地址
        self.context.set_tensor_address(input_name, int(d_input))
        self.context.set_tensor_address(output_name, int(d_output))

        # 执行推理
        self.context.execute_async_v3(self.stream.handle)

        # 拷贝输出到CPU
        cuda.memcpy_dtoh_async(output_data, d_output, self.stream)
        self.stream.synchronize()

        # 释放GPU内存
        d_input.free()
        d_output.free()

        return output_data

# 使用示例
if __name__ == "__main__":
    # 构建Engine
    build_engine("resnet50.onnx", "resnet50_fp16.engine", fp16=True)

    # 推理
    infer = TRTInference("resnet50_fp16.engine")

    # 模拟输入
    input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
    output = infer.infer(input_data)

    top5 = np.argsort(output[0])[-5:][::-1]
    print(f"Top-5 classes: {top5}")

📝 面试考点：TensorRT Python API和C++ API的区别？什么时候用哪个？

14.4 YOLOv8 TensorRT部署实战¶

14.4.1 完整部署流程¶

Python

import cv2
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time

class YOLOv8TRT:
    """YOLOv8 TensorRT推理"""

    def __init__(self, engine_path, conf_thres=0.5, iou_thres=0.45):
        self.conf_thres = conf_thres
        self.iou_thres = iou_thres

        # 加载Engine
        logger = trt.Logger(trt.Logger.WARNING)
        runtime = trt.Runtime(logger)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
        self.stream = cuda.Stream()

    def preprocess(self, img, input_size=640):
        """LetterBox预处理"""
        h, w = img.shape[:2]  # 切片操作，取前n个元素
        scale = min(input_size / h, input_size / w)
        new_h, new_w = int(h * scale), int(w * scale)

        # resize
        resized = cv2.resize(img, (new_w, new_h))

        # padding
        canvas = np.full((input_size, input_size, 3), 114, dtype=np.uint8)
        dh, dw = (input_size - new_h) // 2, (input_size - new_w) // 2
        canvas[dh:dh + new_h, dw:dw + new_w] = resized

        # 归一化 + HWC→CHW + 增加batch维
        blob = canvas.astype(np.float32) / 255.0
        blob = blob.transpose(2, 0, 1)[np.newaxis, ...]
        blob = np.ascontiguousarray(blob)

        return blob, scale, dw, dh

    def postprocess(self, output, scale, dw, dh, img_shape):
        """后处理: NMS"""
        # YOLOv8输出shape: [1, 84, 8400] (COCO 80类)
        # 转为 [8400, 84]
        predictions = output[0].T

        # 获取置信度最高的类别
        class_scores = predictions[:, 4:]  # [8400, 80]
        max_scores = class_scores.max(axis=1)

        # 过滤低置信度
        mask = max_scores > self.conf_thres
        predictions = predictions[mask]
        max_scores = max_scores[mask]
        class_ids = class_scores[mask].argmax(axis=1)

        if len(predictions) == 0:
            return [], [], []

        # 提取bbox (cx, cy, w, h → x1, y1, x2, y2)
        boxes = predictions[:, :4]
        x1 = boxes[:, 0] - boxes[:, 2] / 2
        y1 = boxes[:, 1] - boxes[:, 3] / 2
        x2 = boxes[:, 0] + boxes[:, 2] / 2
        y2 = boxes[:, 1] + boxes[:, 3] / 2

        # 还原到原图坐标
        x1 = (x1 - dw) / scale
        y1 = (y1 - dh) / scale
        x2 = (x2 - dw) / scale
        y2 = (y2 - dh) / scale

        # clip
        x1 = np.clip(x1, 0, img_shape[1])
        y1 = np.clip(y1, 0, img_shape[0])
        x2 = np.clip(x2, 0, img_shape[1])
        y2 = np.clip(y2, 0, img_shape[0])

        boxes_xyxy = np.stack([x1, y1, x2, y2], axis=1)

        # NMS
        indices = self._nms(boxes_xyxy, max_scores, self.iou_thres)

        return boxes_xyxy[indices], max_scores[indices], class_ids[indices]

    def _nms(self, boxes, scores, iou_threshold):
        """非极大值抑制"""
        indices = scores.argsort()[::-1]
        keep = []

        while len(indices) > 0:
            current = indices[0]
            keep.append(current)

            if len(indices) == 1:
                break

            rest = indices[1:]
            ious = self._compute_iou(boxes[current], boxes[rest])
            indices = rest[ious < iou_threshold]

        return keep

    def _compute_iou(self, box, boxes):
        """计算IoU"""
        x1 = np.maximum(box[0], boxes[:, 0])
        y1 = np.maximum(box[1], boxes[:, 1])
        x2 = np.minimum(box[2], boxes[:, 2])
        y2 = np.minimum(box[3], boxes[:, 3])

        inter = np.maximum(0, x2 - x1) * np.maximum(0, y2 - y1)
        area_box = (box[2] - box[0]) * (box[3] - box[1])
        area_boxes = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])

        return inter / (area_box + area_boxes - inter + 1e-6)

    def detect(self, img):
        """端到端检测"""
        # 预处理
        blob, scale, dw, dh = self.preprocess(img)

        # TensorRT推理
        input_name = self.engine.get_tensor_name(0)
        output_name = self.engine.get_tensor_name(1)

        self.context.set_input_shape(input_name, blob.shape)

        d_input = cuda.mem_alloc(blob.nbytes)
        output_shape = self.context.get_tensor_shape(output_name)
        output = np.empty(output_shape, dtype=np.float32)
        d_output = cuda.mem_alloc(output.nbytes)

        cuda.memcpy_htod_async(d_input, blob, self.stream)
        self.context.set_tensor_address(input_name, int(d_input))
        self.context.set_tensor_address(output_name, int(d_output))
        self.context.execute_async_v3(self.stream.handle)
        cuda.memcpy_dtoh_async(output, d_output, self.stream)
        self.stream.synchronize()

        d_input.free()
        d_output.free()

        # 后处理
        boxes, scores, class_ids = self.postprocess(
            output, scale, dw, dh, img.shape)

        return boxes, scores, class_ids

# 使用示例
def main():
    detector = YOLOv8TRT("yolov8n.engine")

    img = cv2.imread("test.jpg")

    # 预热
    for _ in range(10):
        detector.detect(img)

    # 性能测试
    N = 100
    start = time.perf_counter()
    for _ in range(N):
        boxes, scores, class_ids = detector.detect(img)
    elapsed = (time.perf_counter() - start) / N
    fps = 1.0 / elapsed

    print(f"Latency: {elapsed*1000:.1f}ms, FPS: {fps:.0f}")
    print(f"Detections: {len(boxes)}")

    # 可视化
    COCO_NAMES = ["person", "bicycle", "car", ...]  # 80类
    for box, score, cls_id in zip(boxes, scores, class_ids):  # zip按位置配对
        x1, y1, x2, y2 = box.astype(int)
        cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
        label = f"{COCO_NAMES[cls_id]}: {score:.2f}"
        cv2.putText(img, label, (x1, y1 - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    cv2.imwrite("result.jpg", img)

📝 面试考点：YOLOv8的后处理为什么比YOLOv5更简洁？LetterBox预处理的作用？

14.5 ONNX Runtime多后端部署¶

14.5.1 各ExecutionProvider对比¶

Text Only

ONNX Runtime ExecutionProvider:
├── CPUExecutionProvider     默认CPU推理
├── CUDAExecutionProvider    NVIDIA GPU
├── TensorrtExecutionProvider  TensorRT加速（最快）
├── OpenVINOExecutionProvider  Intel CPU/GPU/VPU
├── DirectMLExecutionProvider  Windows GPU通用
├── CoreMLExecutionProvider    Apple Silicon
└── QNNExecutionProvider       Qualcomm NPU

14.5.2 ONNX Runtime推理代码¶

Python

import onnxruntime as ort
import numpy as np

class ONNXInference:
    """ONNX Runtime多后端推理"""

    def __init__(self, model_path, device="gpu"):
        # 根据设备选择ExecutionProvider
        providers = self._get_providers(device)

        # Session选项
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = (
            ort.GraphOptimizationLevel.ORT_ENABLE_ALL)
        sess_options.intra_op_num_threads = 4
        sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

        self.session = ort.InferenceSession(
            model_path, sess_options, providers=providers)

        # 获取输入输出信息
        self.input_name = self.session.get_inputs()[0].name
        self.output_names = [o.name for o in self.session.get_outputs()]

        print(f"Using: {self.session.get_providers()}")

    def _get_providers(self, device):
        if device == "gpu":
            return [
                ("TensorrtExecutionProvider", {
                    "trt_max_workspace_size": 2 << 30,
                    "trt_fp16_enable": True,
                    "trt_engine_cache_enable": True,
                    "trt_engine_cache_path": "./trt_cache",
                }),
                ("CUDAExecutionProvider", {
                    "device_id": 0,
                    "arena_extend_strategy": "kNextPowerOfTwo",
                }),
                "CPUExecutionProvider",
            ]
        elif device == "openvino":
            return [
                ("OpenVINOExecutionProvider", {
                    "device_type": "CPU_FP32",
                }),
                "CPUExecutionProvider",
            ]
        else:
            return ["CPUExecutionProvider"]

    def infer(self, input_data):
        return self.session.run(
            self.output_names, {self.input_name: input_data})

# 使用
infer = ONNXInference("resnet50.onnx", device="gpu")
output = infer.infer(np.random.randn(1, 3, 224, 224).astype(np.float32))
print(f"Output shape: {output[0].shape}")

📝 面试考点：ONNX Runtime与TensorRT直接调用有什么区别？什么时候选用ONNX Runtime？

14.6 Triton Inference Server¶

14.6.1 Triton架构¶

Text Only

Triton Inference Server (NVIDIA):
┌──────────────────────────────────────────┐
│  HTTP/gRPC Client                        │
└────────────────┬─────────────────────────┘
                 ↓
┌──────────────────────────────────────────┐
│  Triton Server                           │
│  ├── Model Repository                    │
│  │   ├── resnet50/                       │
│  │   │   ├── config.pbtxt                │
│  │   │   └── 1/model.plan               │
│  │   ├── yolov8/                         │
│  │   │   ├── config.pbtxt                │
│  │   │   └── 1/model.onnx               │
│  ├── Scheduler                           │
│  │   ├── Dynamic Batching               │
│  │   └── Sequence Batching               │
│  ├── Backend                             │
│  │   ├── TensorRT  ───→  GPU             │
│  │   ├── ONNX Runtime ─→ CPU/GPU         │
│  │   └── Python ────→ Custom logic       │
│  └── Metrics (Prometheus)                │
└──────────────────────────────────────────┘

14.6.2 模型仓库配置¶

Protocol Buffer

# model_repository/resnet50/config.pbtxt
name: "resnet50"
platform: "tensorrt_plan"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# 动态批处理配置
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 100
}

# 多实例配置（Multi-GPU）
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

14.6.3 Triton客户端代码¶

Python

import tritonclient.http as httpclient
import tritonclient.grpc as grpcclient
import numpy as np

class TritonClient:
    """Triton推理客户端"""

    def __init__(self, url="localhost:8000", model_name="resnet50",
                 protocol="http"):
        if protocol == "http":
            self.client = httpclient.InferenceServerClient(url=url)
        else:
            self.client = grpcclient.InferenceServerClient(url=url)

        self.model_name = model_name
        self.protocol = protocol

    def infer(self, input_data: np.ndarray):
        """发送推理请求"""
        if self.protocol == "http":
            inputs = [httpclient.InferInput(
                "input", input_data.shape, "FP32")]
            inputs[0].set_data_from_numpy(input_data)

            outputs = [httpclient.InferRequestedOutput("output")]

            result = self.client.infer(
                self.model_name, inputs, outputs=outputs)
            return result.as_numpy("output")
        else:
            inputs = [grpcclient.InferInput(
                "input", input_data.shape, "FP32")]
            inputs[0].set_data_from_numpy(input_data)

            outputs = [grpcclient.InferRequestedOutput("output")]

            result = self.client.infer(
                self.model_name, inputs, outputs=outputs)
            return result.as_numpy("output")

    def check_health(self):
        """检查服务状态"""
        return {
            "live": self.client.is_server_live(),
            "ready": self.client.is_server_ready(),
            "model_ready": self.client.is_model_ready(self.model_name),
        }

# 启动Triton Server
"""
docker run --gpus=all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $(pwd)/model_repository:/models \
    nvcr.io/nvidia/tritonserver:24.01-py3 \
    tritonserver --model-repository=/models
"""

# 使用
client = TritonClient("localhost:8000", "resnet50")
print(client.check_health())

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = client.infer(input_data)
print(f"Top-1: class {np.argmax(output)}")

14.6.4 Model Ensemble（模型流水线）¶

Protocol Buffer

# model_repository/detection_pipeline/config.pbtxt
name: "detection_pipeline"
platform: "ensemble"
max_batch_size: 1

ensemble_scheduling {
  step [
    {
      model_name: "preprocess"
      model_version: 1
      input_map {
        key: "raw_image"
        value: "RAW_IMAGE"
      }
      output_map {
        key: "preprocessed"
        value: "PREPROCESSED"
      }
    },
    {
      model_name: "yolov8"
      model_version: 1
      input_map {
        key: "input"
        value: "PREPROCESSED"
      }
      output_map {
        key: "output"
        value: "DETECTIONS"
      }
    },
    {
      model_name: "postprocess"
      model_version: 1
      input_map {
        key: "raw_detections"
        value: "DETECTIONS"
      }
      output_map {
        key: "final_boxes"
        value: "BOXES"
      }
    }
  ]
}

📝 面试考点：Triton的Dynamic Batching如何工作？Model Ensemble的适用场景？

14.7 移动端部署¶

14.7.1 NCNN部署（Android/iOS/嵌入式）¶

C++

// NCNN是腾讯开源的移动端推理引擎
// 特点: 纯C++、无第三方依赖、ARM NEON优化

#include "ncnn/net.h"
#include <opencv2/opencv.hpp>
#include <vector>

class NCNNDetector {
public:
    NCNNDetector(const std::string& param_path,
                 const std::string& bin_path) {
        net_.opt.use_vulkan_compute = false;  // 设置是否使用GPU
        net_.opt.num_threads = 4;

        net_.load_param(param_path.c_str());
        net_.load_model(bin_path.c_str());
    }

    std::vector<Detection> detect(const cv::Mat& img,
                                   float conf_thres = 0.5f) {
        // 预处理
        ncnn::Mat input = ncnn::Mat::from_pixels_resize(
            img.data, ncnn::Mat::PIXEL_BGR2RGB,
            img.cols, img.rows, 640, 640);

        // 归一化
        const float mean_vals[3] = {0.f, 0.f, 0.f};
        const float norm_vals[3] = {1/255.f, 1/255.f, 1/255.f};
        input.substract_mean_normalize(mean_vals, norm_vals);

        // 推理
        ncnn::Extractor ex = net_.create_extractor();
        ex.input("input", input);

        ncnn::Mat output;
        ex.extract("output", output);

        // 后处理
        return postprocess(output, img.cols, img.rows, conf_thres);
    }

private:
    ncnn::Net net_;

    struct Detection {
        float x1, y1, x2, y2;
        float confidence;
        int class_id;
    };

    std::vector<Detection> postprocess(const ncnn::Mat& output,
                                        int img_w, int img_h,
                                        float conf_thres) {
        std::vector<Detection> results;
        // ... NMS后处理
        return results;
    }
};

14.7.2 模型转换为NCNN格式¶

Bash

# PyTorch → ONNX → NCNN
# 1. 导出ONNX
python export_onnx.py

# 2. ONNX简化
python -m onnxsim model.onnx model_sim.onnx

# 3. 转为NCNN格式
./onnx2ncnn model_sim.onnx model.param model.bin

# 4. 量化为INT8（减小模型体积+加速）
./ncnn2table model.param model.bin calibration_images/ model.table
./ncnn2int8 model.param model.bin model_int8.param model_int8.bin model.table

14.7.3 移动端推理框架对比¶

框架	公司	语言	量化	GPU	特色
NCNN	腾讯	C++	INT8	Vulkan	无依赖、轻量
MNN	阿里	C++	INT8/MIX	OpenCL/Metal	自适应调度
TFLite	Google	C++	INT8/FP16	GPU Delegate	Android生态
Core ML	Apple	Swift	FP16	ANE(神经引擎)	iOS/macOS
ONNX Runtime Mobile	MS	C++	INT8	NNAPI	跨平台

Text Only

选择建议:
├── Android: NCNN (性能最优) 或 TFLite (生态最好)
├── iOS: Core ML (苹果优化) 或 NCNN (跨平台)
├── 嵌入式Linux: NCNN / ONNX Runtime
└── Jetson: TensorRT (NVIDIA生态)

📝 面试考点：移动端部署和服务器部署的主要区别？选择推理框架的关键考虑因素？

14.8 性能优化最佳实践¶

14.8.1 数据预处理优化（NVIDIA DALI）¶

Python

from nvidia.dali import pipeline_def, fn, types
from nvidia.dali.plugin.pytorch import DALIGenericIterator

@pipeline_def(batch_size=32, num_threads=4, device_id=0)
def dali_pipeline():
    """DALI GPU加速预处理"""
    # 读取文件
    jpegs, labels = fn.readers.file(
        file_root="./images/", random_shuffle=True)

    # GPU解码（比CPU OpenCV快5-10x）
    images = fn.decoders.image(jpegs, device="mixed",
                                output_type=types.RGB)

    # GPU上做resize
    images = fn.resize(images, device="gpu",
                       resize_x=224, resize_y=224)

    # GPU上做归一化
    images = fn.crop_mirror_normalize(
        images,
        device="gpu",
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
    )

    return images, labels

# 使用DALI替代PyTorch DataLoader
pipe = dali_pipeline()
pipe.build()
dali_iter = DALIGenericIterator(pipe, ["images", "labels"])

for batch in dali_iter:
    images = batch[0]["images"]  # 已在GPU上
    labels = batch[0]["labels"]
    # 直接送入模型，无需额外transfer

14.8.2 推理优化检查清单¶

Text Only

✅ 模型优化:
  □ ONNX Simplifier简化计算图
  □ TensorRT层融合(Conv+BN+ReLU)
  □ FP16/INT8精度降低
  □ 剪枝/蒸馏减小模型

✅ 系统优化:
  □ CUDA Streams实现预处理/推理/后处理流水线
  □ 输入固定大小（避免Dynamic Shape开销）
  □ GPU显存预分配（避免运行时malloc）
  □ 数据预处理放GPU上（DALI/OpenCV CUDA）

✅ 服务优化:
  □ Dynamic Batching收集请求
  □ 多模型实例（Multi-instance）
  □ 模型预热（Warm-up）消除首次延迟
  □ 连接池复用gRPC连接

✅ 监控:
  □ P50/P95/P99延迟
  □ 吞吐量(QPS)
  □ GPU利用率和显存
  □ 模型精度在线监测(数据漂移)

14.8.3 推理Pipeline优化¶

Python

import threading
import queue
import time
import numpy as np

class InferencePipeline:
    """三级流水线: 预处理 | 推理 | 后处理"""

    def __init__(self, model, preprocess_fn, postprocess_fn,
                 batch_size=8, max_queue_size=100):
        self.model = model
        self.preprocess_fn = preprocess_fn
        self.postprocess_fn = postprocess_fn
        self.batch_size = batch_size

        # 三个队列连接三个阶段
        self.raw_queue = queue.Queue(max_queue_size)
        self.infer_queue = queue.Queue(max_queue_size)
        self.result_queue = queue.Queue(max_queue_size)

        self.running = False

    def start(self):
        self.running = True
        # 启动三个工作线程
        threading.Thread(target=self._preprocess_worker, daemon=True).start()
        threading.Thread(target=self._infer_worker, daemon=True).start()
        threading.Thread(target=self._postprocess_worker, daemon=True).start()

    def _preprocess_worker(self):
        """预处理线程(CPU)"""
        batch = []
        while self.running:
            try:  # try/except捕获异常
                item = self.raw_queue.get(timeout=0.01)
                batch.append(self.preprocess_fn(item))

                if len(batch) >= self.batch_size:
                    self.infer_queue.put(np.stack(batch))
                    batch = []
            except queue.Empty:
                if batch:  # 超时时发送不完整的batch
                    self.infer_queue.put(np.stack(batch))
                    batch = []

    def _infer_worker(self):
        """推理线程(GPU)"""
        while self.running:
            try:
                batch = self.infer_queue.get(timeout=0.1)
                output = self.model.infer(batch)
                self.result_queue.put(output)
            except queue.Empty:
                continue

    def _postprocess_worker(self):
        """后处理线程(CPU)"""
        while self.running:
            try:
                output = self.result_queue.get(timeout=0.1)
                results = self.postprocess_fn(output)
                # 回调或存储结果
            except queue.Empty:
                continue

    def submit(self, item):
        """提交推理请求"""
        self.raw_queue.put(item)

    def stop(self):
        self.running = False

📝 面试考点：推理Pipeline三级流水线的设计原理？Dynamic Batching如何平衡延迟和吞吐？

14.9 面试高频题¶

Q1: PyTorch模型导出ONNX时需要注意什么？¶

答：(1)opset_version选择：尽量用新版opset（如17），支持更多算子；(2)dynamic_axes设置动态维度，否则batch dimension被固定；(3)do_constant_folding=True做常量折叠优化；(4)复杂的控制流(if/for/while)可能导出失败，需改写为tensor操作；(5)自定义算子需注册ONNX symbolic；(6)导出后必须用onnx.checker.check_model验证+精度对比。

Q2: TensorRT的主要优化技术有哪些？¶

答：(1)层融合(Layer Fusion)：Conv+BN+ReLU合并为一个kernel；(2)精度校准：FP32→FP16/INT8自动降精度；(3)Kernel自动调优(AutoTuning)：为每一层测试多种kernel实现选最快的；(4)Tensor内存复用：不同层的中间tensor复用同一块显存；(5)多Stream推理：利用CUDA Stream并行执行独立的分支。

Q3: 如何选择推理引擎？¶

答：决策树：(1)NVIDIA GPU → TensorRT（性能最强）；(2)Intel CPU → OpenVINO（有CPU专门优化）；(3)需要跨平台兼容 → ONNX Runtime（支持多EP）；(4)移动端Android → NCNN（性能最优）或TFLite（生态最好）；(5)iOS → Core ML（Apple ANE加速）；(6)Jetson边缘 → TensorRT + DeepStream。

Q4: Triton Inference Server的Dynamic Batching如何工作？¶

答：Triton的Dynamic Batching会暂时缓存到达的请求，在max_queue_delay_microseconds时间内收集尽可能多的请求组成大batch。好处是高负载时吞吐翻数倍（GPU并行），代价是低负载时增加少量延迟。preferred_batch_sizehint告诉Triton最优batch大小。对于延迟敏感场景，设小延迟；吞吐优先场景，设大延迟。

Q5: INT8量化部署后精度下降怎么办？¶

答：排查步骤：(1)检查校准数据是否具有代表性（至少500-1000张覆盖各类场景）；(2)换用Entropy校准替代MinMax（更鲁棒）；(3)对精度敏感的层（如第一层Conv、最后的全连接）保持FP16，不量化；(4)在TensorRT中可用setLayerPrecision逐层指定精度（混合精度量化）；(5)使用QAT(Quantization-Aware Training)微调模型感知量化误差。

Q6: 移动端部署与服务器部署的核心差异？¶

答：(1)算力差异：手机GPU仅服务器GPU的1/50-1/100；(2)功耗约束：移动端必须考虑电池续航；(3)内存限制：手机通常2-8GB RAM，模型+推理占用需<200MB；(4)框架选择：移动端用NCNN/MNN/TFLite（轻量、不依赖大运行时），服务器用TensorRT；(5)模型裁剪：移动端通常用MobileNet/ShuffleNet等轻量模型+INT8量化。

Q7: 推理延迟和吞吐如何同时优化？¶

答：(1)延迟优化：减小batch_size(=1最低延迟)、FP16/INT8加速、CUDA Graph固定kernel launch开销、预分配显存；(2)吞吐优化：增大batch_size(GPU利用率高)、Dynamic Batching、多模型实例、异步推理Pipeline；(3)两者矛盾时用三级流水线：预处理(CPU)→推理(GPU)→后处理(CPU)并行执行，P99延迟≈max(各阶段)而非sum。

Q8: 如何确保模型部署后的精度不降？¶

答：(1)离线验证：ONNX/TensorRT输出与PyTorch逐element对比，cosine similarity > 0.9999；(2)测试集评估：部署后在完整测试集上跑一遍指标（mAP/Accuracy/F1）与训练结果比较；(3)在线监控：抽样推理结果人工或自动校验，检测数据漂移（Data Drift）；(4)金丝雀发布：先对5%流量用新模型，对比核心指标无退化再全量上线。

Q9: ONNX Runtime中TensorrtExecutionProvider和直接用TensorRT有什么区别？¶

答：ORT的TensorRT EP是在ONNX Runtime内部调用TensorRT加速：优点是代码简单（一行切换EP）、自动fallback到CUDA EP处理不支持的op；缺点是Engine缓存管理不如直接TensorRT灵活、首次推理需要build Engine较慢。直接用TensorRT原生API有完全控制权（选kernel、逐层精度、Plugin开发），适合需要极致优化的场景。

Q10: 部署Pipeline的监控应该关注哪些指标？¶

答：核心指标：(1)延迟：P50/P95/P99分别监控（P99比平均更能反映问题）；(2)吞吐(QPS)：当前请求率和GPU利用率；(3)错误率：推理失败、超时请求比例；(4)资源：GPU显存使用、GPU SM利用率、CPU使用率；(5)队列深度：Dynamic Batching队列是否堆积（预示需要扩容）；(6)模型精度：在线采样验证，检测数据漂移和模型退化。

14.10 本章小结¶

核心知识点¶

概念	要点
ONNX导出	dynamic_axes, opset_version, 精度验证
TensorRT	层融合+精度校准+Kernel AutoTuning, Engine设备相关
INT8量化	校准器(Entropy>MinMax), 混合精度保护敏感层
Triton	Dynamic Batching, Model Ensemble, Multi-instance
移动端	NCNN(性能) vs TFLite(生态), INT8+轻量模型
Pipeline	预处理\|推理\|后处理三级流水线并行

部署决策路径¶

Text Only

目标吞吐 >1000 QPS?
├── Yes → TensorRT FP16/INT8 + Triton + 多GPU
└── No
    延迟 <10ms?
    ├── Yes → TensorRT + 预处理GPU加速 + 固定Batch
    └── No
        需要跨平台?
        ├── Yes → ONNX Runtime
        └── No → TensorRT(GPU) / OpenVINO(CPU)

恭喜完成第14章！ 🎉

第14章 视觉模型部署实战¶

📚 章节概述¶

14.1 部署Pipeline总览¶

14.1.1 从训练到部署¶

14.1.2 部署目标平台¶

14.2 PyTorch模型导出ONNX¶

14.2.1 基础导出¶

14.2.2 检测模型导出（YOLOv8）¶

14.2.3 ONNX模型验证与优化¶

14.3 TensorRT Python API推理¶

14.3.1 TensorRT Python构建Engine¶

14.4 YOLOv8 TensorRT部署实战¶

14.4.1 完整部署流程¶

14.5 ONNX Runtime多后端部署¶

14.5.1 各ExecutionProvider对比¶

14.5.2 ONNX Runtime推理代码¶

14.6 Triton Inference Server¶

14.6.1 Triton架构¶

14.6.2 模型仓库配置¶

14.6.3 Triton客户端代码¶

14.6.4 Model Ensemble（模型流水线）¶

14.7 移动端部署¶

14.7.1 NCNN部署（Android/iOS/嵌入式）¶

14.7.2 模型转换为NCNN格式¶

14.7.3 移动端推理框架对比¶

14.8 性能优化最佳实践¶

14.8.1 数据预处理优化（NVIDIA DALI）¶

14.8.2 推理优化检查清单¶

14.8.3 推理Pipeline优化¶

14.9 面试高频题¶

Q1: PyTorch模型导出ONNX时需要注意什么？¶

Q2: TensorRT的主要优化技术有哪些？¶

Q3: 如何选择推理引擎？¶

Q4: Triton Inference Server的Dynamic Batching如何工作？¶

Q5: INT8量化部署后精度下降怎么办？¶

Q6: 移动端部署与服务器部署的核心差异？¶

Q7: 推理延迟和吞吐如何同时优化？¶

Q8: 如何确保模型部署后的精度不降？¶

Q9: ONNX Runtime中TensorrtExecutionProvider和直接用TensorRT有什么区别？¶

Q10: 部署Pipeline的监控应该关注哪些指标？¶

14.10 本章小结¶

核心知识点¶

部署决策路径¶

第14章视觉模型部署实战¶