第18章 eBPF 与可观测性¶

学习时间: 3.5小时 难度级别: 高级 重要性: ⭐⭐⭐⭐⭐ 下一代云原生可观测性的核心技术

🎯 学习目标¶

完成本章后，你将能够： - 理解 eBPF 的内核态可编程原理与安全机制 - 掌握 eBPF 在网络、性能、安全领域的可观测性应用 - 深入理解 OpenTelemetry 架构与 OTLP 协议 - 配置 OpenTelemetry Collector 并实现 Auto-instrumentation - 构建 Kubernetes 环境下 metrics/traces/logs 统一可观测性方案

1. eBPF 概述¶

1.1 什么是 eBPF？¶

eBPF（extended Berkeley Packet Filter）是一种革命性的内核技术，允许在不修改内核源码、不加载内核模块的情况下，在 Linux 内核中运行沙箱化的程序。

Text Only

┌─────────────────────────────────────────────┐
│                   用户空间                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ bpftrace │  │  Cilium  │  │ Tetragon │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │              │              │        │
├───────┼──────────────┼──────────────┼────────┤
│       │         内核空间             │        │
│  ┌────▼──────────────▼──────────────▼────┐   │
│  │           eBPF 虚拟机 (JIT)            │   │
│  │  ┌─────────┐ ┌──────┐ ┌──────────┐   │   │
│  │  │ Verifier│ │ Maps │ │ Helpers  │   │   │
│  │  └─────────┘ └──────┘ └──────────┘   │   │
│  └───────────────────────────────────────┘   │
│       │              │              │        │
│  ┌────▼────┐  ┌─────▼─────┐  ┌────▼────┐   │
│  │ kprobes │  │  TC/XDP   │  │ tracepoint│  │
│  │ uprobes │  │ cgroup    │  │ perf_event│  │
│  └─────────┘  └───────────┘  └──────────┘   │
└─────────────────────────────────────────────┘

1.2 eBPF 核心概念¶

概念	说明
Verifier	静态分析器，确保 eBPF 程序安全（无死循环、无越界访问）
JIT Compiler	将 eBPF 字节码编译为原生机器码，接近原生性能
Maps	内核态与用户态的共享数据结构（HashMap、Array、RingBuffer 等）
Hook Points	挂载点：kprobes、tracepoints、XDP、TC、cgroup、socket 等
Helpers	内核提供的辅助函数（获取时间、操作 Map、发送数据等）

1.3 eBPF 安全沙箱机制¶

Text Only

eBPF 程序安全保障：
1. Verifier 静态验证
   - 检查所有路径是否终止（禁止无限循环）
   - 检查内存访问边界
   - 检查未初始化变量使用
   - 限制程序大小（100万条指令）

2. 权限控制
   - 需要 CAP_BPF / CAP_SYS_ADMIN 权限
   - 非特权 eBPF 受限于 socket 过滤

3. 资源限制
   - Map 大小有上限
   - 栈深度限制（512 字节）
   - 尾调用最多 33 层

2. eBPF 在可观测性中的应用¶

2.1 网络流量分析 — Cilium¶

Cilium 是基于 eBPF 的 Kubernetes CNI 和可观测性平台。

YAML

# 安装 Cilium（Helm）
# helm repo add cilium https://helm.cilium.io/
# helm install cilium cilium/cilium --namespace kube-system \
#   --set hubble.relay.enabled=true \
#   --set hubble.ui.enabled=true

# Hubble 网络可观测性 — 查看实时流量
# hubble observe --namespace default
# hubble observe --protocol http --verdict DROPPED

# Cilium NetworkPolicy 示例
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-api
spec:
  endpointSelector:
    matchLabels:
      app: api-server
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: "GET"
                path: "/api/v1/.*"

Hubble 提供的可观测性能力： - L3/L4 网络流量（TCP/UDP 连接、丢包统计） - L7 协议感知（HTTP 请求/响应、gRPC、Kafka、DNS） - 服务依赖拓扑图（Service Map） - NetworkPolicy 审计（允许/拒绝日志）

2.2 性能分析 — bpftrace & bcc¶

Bash

# === bpftrace：单行性能分析 ===

# 追踪系统调用延迟
bpftrace -e 'tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
  @us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

# 追踪进程执行
bpftrace -e 'tracepoint:sched:sched_process_exec {
  printf("exec: %s (pid=%d)\n", comm, pid);
}'

# 统计每秒系统调用次数
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
interval:s:1 { print(@); clear(@); }'

# === bcc 工具集 ===
# TCP 连接延迟追踪
tcpconnlat            # 追踪 TCP 连接建立延迟
tcplife               # 追踪 TCP 连接生命周期
tcpretrans            # 追踪 TCP 重传

# 磁盘 I/O 分析
biolatency            # 块设备 I/O 延迟直方图
biosnoop              # 追踪每个块设备 I/O 请求
ext4slower 1          # 追踪超过 1ms 的 ext4 操作

# CPU 与调度
profile               # CPU 性能剖析（采样栈帧）
runqlat               # CPU 调度队列延迟
cpudist               # 每次 CPU 使用时间分布

# 内存
memleak               # 检测内存泄漏
oomkill               # 追踪 OOM Kill 事件

2.3 安全审计 — Falco & Tetragon¶

YAML

# === Falco：运行时安全检测 ===
# Falco 使用 eBPF 探针监控系统调用，匹配安全规则

# 自定义告警规则
- rule: Unexpected Outbound Connection
  desc: Detect unexpected outbound network connections from containers
  condition: >
    evt.type=connect and
    evt.dir=< and
    container and
    not (fd.sport in (80, 443, 53)) and
    not k8s.ns.name in (kube-system, monitoring)
  output: >
    Unexpected outbound connection
    (command=%proc.cmdline connection=%fd.name
     container=%container.name namespace=%k8s.ns.name
     pod=%k8s.pod.name image=%container.image.repository)
  priority: WARNING
  tags: [network, mitre_command_and_control]

# === Tetragon：eBPF 原生安全可观测 ===
# Tetragon 是 Cilium 旗下的安全可观测性工具
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: monitor-sensitive-files
spec:
  kprobes:
    - call: fd_install
      syscall: false
      args:
        - index: 0
          type: int
        - index: 1
          type: "file"
      selectors:
        - matchArgs:
            - index: 1
              operator: Prefix
              values:
                - /etc/shadow
                - /etc/passwd
                - /etc/kubernetes/
          matchActions:
            - action: Sigkill  # 直接终止进程

3. OpenTelemetry 深入¶

3.1 OTLP 协议¶

OTLP（OpenTelemetry Protocol）是 OpenTelemetry 项目的标准数据传输协议，支持 Traces、Metrics、Logs 三种信号。

Text Only

OTLP 传输方式：
┌─────────────────┬──────────────────┬─────────────────┐
│   OTLP/gRPC     │   OTLP/HTTP      │  OTLP/HTTP-JSON │
│   (默认:4317)   │   (默认:4318)    │   (调试用)      │
│   Protobuf      │   Protobuf       │   JSON           │
│   流式 + 高效   │   兼容性好       │   可读性强       │
└─────────────────┴──────────────────┴─────────────────┘

数据模型（以 Trace 为例）：
Resource → ScopeSpans → Span
  ├── TraceId (16 bytes)
  ├── SpanId (8 bytes)
  ├── ParentSpanId
  ├── Name
  ├── Kind (CLIENT/SERVER/PRODUCER/CONSUMER/INTERNAL)
  ├── StartTime / EndTime
  ├── Attributes (key-value pairs)
  ├── Events (带时间戳的日志)
  ├── Links (关联其他 trace)
  └── Status (OK/ERROR/UNSET)

3.2 Collector 架构（Receiver / Processor / Exporter）¶

OpenTelemetry Collector 是一个可插拔的数据管道，负责接收、处理和导出遥测数据。

YAML

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # 从 Prometheus 端点抓取指标
  prometheus:
    config:
      scrape_configs:
        - job_name: "kubernetes-pods"
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  # 收集 Kubernetes 事件
  k8s_events:
    namespaces: [default, production]

processors:
  # 批处理：减少网络调用
  batch:
    send_batch_size: 8192
    timeout: 200ms

  # 内存限制：防止 OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256

  # 属性处理
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: db.password
        action: delete  # 移除敏感信息

  # 尾部采样（基于完整 trace 决策）
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  # 导出到 Prometheus
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

  # 导出到 Jaeger
  otlp/jaeger:
    endpoint: "jaeger-collector:4317"
    tls:
      insecure: true

  # 导出到 Loki（日志）
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

  # 导出到 Grafana Tempo（Traces）
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo, otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, attributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp, k8s_events]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

3.3 Auto-instrumentation（Python / Java / Node.js）¶

Auto-instrumentation 无需修改应用代码即可自动采集遥测数据。

YAML

# === Kubernetes Operator 方式（推荐） ===
# 安装 OpenTelemetry Operator
# kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# 创建 Instrumentation 资源
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: default
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"  # 25% 采样率

  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
    env:
      - name: OTEL_PYTHON_LOG_CORRELATION
        value: "true"

  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
    env:
      - name: OTEL_INSTRUMENTATION_JDBC_ENABLED
        value: "true"

  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
---  # YAML文档分隔符
# 在 Deployment 上添加注解即可自动注入
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-python-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
        - name: app
          image: my-python-app:latest

Python

# === 手动 SDK 方式（Python 示例） ===
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource

# 配置 Resource
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

# 配置 Traces
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)

# 配置 Metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317"),
    export_interval_millis=15000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# 使用
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http.requests", description="HTTP request count")

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    request_counter.add(1, {"method": "POST", "endpoint": "/orders"})
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    # 业务逻辑...

3.4 与 Prometheus / Grafana / Jaeger 集成¶

Text Only

统一可观测性架构：

                    ┌──────────────────┐
                    │   Application    │
                    │ (OTLP SDK/Agent) │
                    └────────┬─────────┘
                             │ OTLP
                    ┌────────▼─────────┐
                    │  OTel Collector  │
                    │  (Gateway Mode)  │
                    └──┬─────┬─────┬───┘
                       │     │     │
              ┌────────┘     │     └────────┐
              │              │              │
     ┌────────▼────┐  ┌─────▼─────┐  ┌─────▼─────┐
     │ Prometheus  │  │  Grafana  │  │   Jaeger   │
     │  (Metrics)  │  │  Tempo    │  │  (Traces)  │
     │             │  │  (Traces) │  │            │
     └──────┬──────┘  └─────┬─────┘  └──────┬────┘
            │               │               │
            └───────────────┼───────────────┘
                    ┌───────▼───────┐
                    │    Grafana    │
                    │  Dashboard   │
                    │ (统一展示)    │
                    └──────────────┘

4. Kubernetes 可观测性实战¶

4.1 Metrics / Traces / Logs 统一方案¶

YAML

# === 完整的 Kubernetes 可观测性栈部署（Helm） ===

# 1. 部署 Prometheus + Grafana（kube-prometheus-stack）
# helm install prometheus prometheus-community/kube-prometheus-stack \
#   --namespace monitoring --create-namespace \
#   --set grafana.enabled=true

# 2. 部署 Grafana Tempo（分布式追踪）
# helm install tempo grafana/tempo-distributed \
#   --namespace monitoring

# 3. 部署 Grafana Loki（日志聚合）
# helm install loki grafana/loki-stack \
#   --namespace monitoring \
#   --set promtail.enabled=true

# 4. 部署 OpenTelemetry Collector（DaemonSet 模式）
apiVersion: opentelemetry.io/v1beta1  # apiVersion指定K8s API版本
kind: OpenTelemetryCollector  # kind指定资源类型
metadata:
  name: otel-collector
  namespace: monitoring
spec:  # spec定义资源的期望状态
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "https://${env:K8S_NODE_NAME}:10250"
        insecure_skip_verify: true

    processors:
      k8sattributes:
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.node.name
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
      batch:
        send_batch_size: 4096
        timeout: 5s

    exporters:
      prometheusremotewrite:
        endpoint: "http://prometheus-kube-prometheus-prometheus:9090/api/v1/write"
      otlp/tempo:
        endpoint: "tempo-distributor:4317"
        tls:
          insecure: true
      loki:
        endpoint: "http://loki:3100/loki/api/v1/push"

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [k8sattributes, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp, kubeletstats]
          processors: [k8sattributes, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [k8sattributes, batch]
          exporters: [loki]

4.2 Grafana 统一仪表盘¶

Text Only

// Grafana Dashboard — 关联 Metrics → Traces → Logs
// 关键配置：数据源关联
{
  "datasources": [
    {
      "name": "Prometheus",
      "type": "prometheus",
      "jsonData": {
        "exemplarTraceIdDestinations": [
          { "name": "traceID", "datasourceUid": "tempo" }
        ]
      }
    },
    {
      "name": "Tempo",
      "type": "tempo",
      "jsonData": {
        "tracesToLogs": {
          "datasourceUid": "loki",
          "tags": ["k8s.pod.name", "k8s.namespace.name"]
        },
        "tracesToMetrics": {
          "datasourceUid": "prometheus"
        },
        "serviceMap": { "datasourceUid": "prometheus" }
      }
    },
    {
      "name": "Loki",
      "type": "loki",
      "jsonData": {
        "derivedFields": [
          {
            "matcherRegex": "traceID=(\\w+)",
            "name": "TraceID",
            "url": "${__value.raw}",
            "datasourceUid": "tempo"
          }
        ]
      }
    }
  ]
}

5. 面试题精选¶

面试题 1：什么是 eBPF？它相比传统内核模块有什么优势？¶

参考答案： - eBPF 允许用户在内核空间运行沙箱化程序，无需修改内核或加载内核模块 - 安全性：Verifier 静态验证确保程序安全，不会导致内核崩溃 - 动态性：可在运行时动态加载卸载，无需重启 - 性能：JIT 编译接近原生性能，远优于用户态采集 - 可编程性：可挂载到 kprobes、tracepoints、XDP 等多种 Hook 点

面试题 2：OpenTelemetry Collector 的三大组件是什么？¶

参考答案： - Receiver：接收遥测数据（OTLP、Prometheus、Jaeger、Zipkin 等格式） - Processor：处理数据（批处理、采样、属性操作、内存限制等） - Exporter：导出数据到后端（Prometheus、Jaeger、Tempo、Loki 等） - 通过 Pipeline 将三者组合，每种信号（traces/metrics/logs）可配独立 pipeline

面试题 3：解释 OpenTelemetry 中的头部采样与尾部采样的区别¶

特性	头部采样	尾部采样
决策时机	trace 开始时	trace 结束后
所需信息	仅 traceID	完整 trace 数据
资源消耗	低	高（需缓存完整 trace）
智能程度	简单（概率/速率）	智能（基于状态码、延迟、属性）
适用场景	高吞吐量环境	需要保留错误/慢请求

面试题 4：如何在 Kubernetes 中实现 Metrics → Traces → Logs 的关联？¶

参考答案： 1. Metrics → Traces：Prometheus Exemplars 携带 traceID，点击指标可跳转到对应 trace 2. Traces → Logs：Tempo 配置 tracesToLogs，通过 pod name/namespace 关联 Loki 日志 3. Logs → Traces：Loki derivedFields 正则提取 traceID，链接到 Tempo 4. 统一通过 OpenTelemetry 注入 trace_id、span_id 到日志中，实现完整关联

6. 检查清单¶

eBPF 基础¶

理解 eBPF 虚拟机的运行机制（Verifier → JIT → 执行）
了解 eBPF Maps 的数据结构类型和用户态/内核态通信
能使用 bpftrace 编写简单的追踪脚本
理解 eBPF 安全沙箱的限制与保障

可观测性工具¶

了解 Cilium + Hubble 的网络可观测性能力
掌握 bcc 工具集进行性能分析（tcpconnlat、biolatency 等）
理解 Falco/Tetragon 的安全审计原理

OpenTelemetry¶

理解 OTLP 协议和三种传输方式
能配置 OpenTelemetry Collector（Receiver/Processor/Exporter）
掌握 Auto-instrumentation 的 Operator 方式和手动 SDK 方式
理解头部采样与尾部采样的区别和适用场景

Kubernetes 可观测性¶

能部署 Prometheus + Tempo + Loki + OTel Collector 统一栈
配置 Grafana 数据源关联实现 Metrics ↔ Traces ↔ Logs 跳转
理解 k8sattributes processor 的元数据注入

上一章: 17-基础设施即代码 下一章: 19-混沌工程 返回目录: README