05-内存模型与数据传输¶

重要性：⭐⭐⭐⭐⭐ 实用度：⭐⭐⭐⭐⭐ 学习时间：2天 必须掌握：是

为什么学这一章？¶

GPU程序的性能瓶颈往往不是计算而是内存访问。理解GPU的内存层次结构、合并访问模式和数据传输优化，是将CUDA程序性能提升10-100倍的关键。

学完这一章，你将能够： - ✅ 理解GPU的完整内存层次（寄存器→共享内存→全局内存） - ✅ 掌握合并访问（Coalesced Access）原理 - ✅ 使用共享内存优化访问模式 - ✅ 优化Host与Device之间的数据传输

📖 核心概念¶

1. GPU内存层次结构¶

Text Only

┌─────────────────────────────────────────────────────────────┐
│              GPU 内存层次结构                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  速度快                                                      │
│  ↑                                                          │
│  │  寄存器（Registers）                                     │
│  │  ├── 每个线程私有，速度最快                               │
│  │  ├── 数量有限（每SM 65536个32位寄存器）                   │
│  │  └── 延迟：~1个时钟周期                                  │
│  │                                                          │
│  │  共享内存（Shared Memory）                               │
│  │  ├── Block内所有线程共享                                 │
│  │  ├── 用户管理的缓存（48KB-164KB/SM）                    │
│  │  ├── 延迟：~5-30个时钟周期                              │
│  │  └── 需要__syncthreads()保证一致性                      │
│  │                                                          │
│  │  L1缓存 / 纹理缓存                                     │
│  │  ├── 硬件管理，对程序员透明                              │
│  │  └── 延迟：~30-50个时钟周期                             │
│  │                                                          │
│  │  L2缓存                                                  │
│  │  ├── 所有SM共享                                          │
│  │  └── 延迟：~200个时钟周期                               │
│  │                                                          │
│  │  全局内存（Global Memory / DRAM）                        │
│  │  ├── 所有线程可访问，容量大（8-80GB）                   │
│  │  ├── 延迟：~400-600个时钟周期                           │
│  │  └── 带宽：~900GB/s（A100）                             │
│  ↓                                                          │
│  速度慢                                                      │
│                                                             │
│  主机内存（Host Memory / CPU DRAM）                         │
│  ├── 通过PCIe总线传输：~32GB/s（PCIe 4.0 x16）            │
│  └── 通过NVLink传输：~600GB/s                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. 内存类型详解¶

Text Only

┌─────────────────────────────────────────────────────────────┐
│  内存类型    作用域      生命周期    声明方式                  │
├─────────────────────────────────────────────────────────────┤
│  寄存器      线程        线程        自动变量                 │
│  局部内存    线程        线程        大数组/溢出              │
│  共享内存    Block       Block       __shared__              │
│  全局内存    所有线程    应用        cudaMalloc / __device__  │
│  常量内存    所有线程    应用        __constant__（64KB）    │
│  纹理内存    所有线程    应用        texture对象（有缓存）    │
└─────────────────────────────────────────────────────────────┘

3. 合并访问（Coalesced Access）¶

Text Only

┌─────────────────────────────────────────────────────────────┐
│              合并访问 vs 非合并访问                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  合并访问（高效）：                                          │
│  Warp中32个线程访问连续的32×4=128字节                       │
│  → 一次128字节事务完成                                      │
│                                                             │
│  线程0→data[0], 线程1→data[1], ..., 线程31→data[31]       │
│  ┌──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┐       │
│  │ 0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│ ...                   │ 128B  │
│  └──┴──┴──┴──┴──┴──┴──┴──┴───────────────────────┘       │
│  ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑                                  │
│  t0 t1 t2 t3 t4 t5 t6 t7  → 一次内存事务                  │
│                                                             │
│  非合并访问（低效）：                                        │
│  线程0→data[0], 线程1→data[32], 线程2→data[64]...         │
│  → 每个线程访问不同的缓存行，触发32次内存事务！             │
│                                                             │
│  带宽利用率：                                                │
│  合并访问：~100%                                            │
│  跨步=32：~3%（浪费97%带宽）                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

CUDA

// coalesced_access.cu - 合并访问对比
#include <stdio.h>
#include <cuda_runtime.h>

#define N (1 << 22)  // 4M元素

// 合并访问：相邻线程访问相邻元素
__global__ void coalesced_read(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] * 2.0f;  // 线程idx访问data[idx]
    }
}

// 非合并访问：跨步访问
__global__ void strided_read(float *input, float *output, int n, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int real_idx = (idx * stride) % n;  // 跨步访问
    if (idx < n) {
        output[idx] = input[real_idx] * 2.0f;
    }
}

int main() {
    size_t size = N * sizeof(float);
    float *d_input, *d_output;
    cudaMalloc(&d_input, size);
    cudaMalloc(&d_output, size);

    // 初始化
    float *h_input = (float *)malloc(size);
    for (int i = 0; i < N; i++) h_input[i] = (float)i;
    cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);

    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;

    // CUDA事件计时
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    float ms;

    // 合并访问
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        coalesced_read<<<gridSize, blockSize>>>(d_input, d_output, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("合并访问:   %.2f ms (100次平均 %.3f ms)\n", ms, ms / 100);

    // 跨步访问（stride=32）
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        strided_read<<<gridSize, blockSize>>>(d_input, d_output, N, 32);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("跨步访问(32): %.2f ms (100次平均 %.3f ms)\n", ms, ms / 100);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    free(h_input);
    cudaFree(d_input);
    cudaFree(d_output);
    return 0;
}

Bash

nvcc coalesced_access.cu -o coalesced && ./coalesced

4. 共享内存优化¶

CUDA

// shared_memory_opt.cu - 使用共享内存优化矩阵转置
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_DIM 32
#define BLOCK_ROWS 8

// 朴素转置（全局内存非合并写入）
__global__ void transpose_naive(float *input, float *output,
                                 int width, int height) {
    int xIdx = blockIdx.x * TILE_DIM + threadIdx.x;
    int yIdx = blockIdx.y * TILE_DIM + threadIdx.y;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIdx < width && (yIdx + j) < height) {
            // 读取是合并的，但写入是跨步的（非合并）
            output[xIdx * height + (yIdx + j)] = input[(yIdx + j) * width + xIdx];
        }
    }
}

// 使用共享内存的转置（读写都合并）
__global__ void transpose_shared(float *input, float *output,
                                  int width, int height) {
    // 多一列避免Bank Conflict
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];

    int xIdx = blockIdx.x * TILE_DIM + threadIdx.x;
    int yIdx = blockIdx.y * TILE_DIM + threadIdx.y;

    // 合并读取：从行主序的input读取一个tile
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIdx < width && (yIdx + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = input[(yIdx + j) * width + xIdx];
        }
    }
    __syncthreads();

    // 转换索引
    xIdx = blockIdx.y * TILE_DIM + threadIdx.x;
    yIdx = blockIdx.x * TILE_DIM + threadIdx.y;

    // 合并写入：从共享内存读取转置后的数据
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIdx < height && (yIdx + j) < width) {
            output[(yIdx + j) * height + xIdx] = tile[threadIdx.x][threadIdx.y + j];
        }
    }
}

int main() {
    int width = 1024, height = 1024;
    size_t size = width * height * sizeof(float);

    float *h_input = (float *)malloc(size);
    float *h_output = (float *)malloc(size);
    for (int i = 0; i < width * height; i++) h_input[i] = (float)i;

    float *d_input, *d_output;
    cudaMalloc(&d_input, size);
    cudaMalloc(&d_output, size);
    cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);

    dim3 grid(width / TILE_DIM, height / TILE_DIM);
    dim3 block(TILE_DIM, BLOCK_ROWS);

    // 计时
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    float ms;

    // 朴素版本
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        transpose_naive<<<grid, block>>>(d_input, d_output, width, height);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("朴素转置:     %.2f ms/次\n", ms / 100);

    // 共享内存版本
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        transpose_shared<<<grid, block>>>(d_input, d_output, width, height);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("共享内存转置: %.2f ms/次\n", ms / 100);

    // 验证
    cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);
    int correct = 1;
    for (int y = 0; y < height && correct; y++)
        for (int x = 0; x < width && correct; x++)
            if (h_output[y * width + x] != h_input[x * height + y])
                correct = 0;
    printf("验证: %s\n", correct ? "正确" : "错误");

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    free(h_input); free(h_output);
    cudaFree(d_input); cudaFree(d_output);
    return 0;
}

5. 共享内存Bank Conflict¶

Text Only

┌─────────────────────────────────────────────────────────────┐
│              Bank Conflict 原理                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  共享内存被分为32个Bank，每个Bank宽4字节：                   │
│                                                             │
│  Bank:   0    1    2    3   ...   31                        │
│  地址: [0-3][4-7][8-11][12-15] [124-127]                   │
│        [128-131][132-135]...    [252-255]                   │
│                                                             │
│  无冲突：32个线程访问32个不同Bank                            │
│  → 一个周期完成                                             │
│                                                             │
│  Bank Conflict：多个线程访问同一Bank的不同地址               │
│  → 串行化，n-way冲突→n个周期                               │
│                                                             │
│  特殊情况：多个线程访问同一Bank的同一地址                    │
│  → 广播，无冲突                                             │
│                                                             │
│  避免方法：                                                  │
│  • 对2D共享内存数组加padding：                              │
│    __shared__ float tile[32][32+1]; // +1错开Bank           │
│  • 调整访问步长避免2的幂次                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

6. Host-Device数据传输优化¶

CUDA

// transfer_optimization.cu - 数据传输优化技术
#include <stdio.h>
#include <cuda_runtime.h>

#define SIZE (64 * 1024 * 1024)  // 64MB

__global__ void dummy_kernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) data[idx] *= 2.0f;
}

int main() {
    float *h_data_pageable, *h_data_pinned;
    float *d_data;
    int n = SIZE / sizeof(float);

    // 普通内存（pageable）
    h_data_pageable = (float *)malloc(SIZE);

    // 锁页内存（pinned）
    cudaMallocHost(&h_data_pinned, SIZE);

    cudaMalloc(&d_data, SIZE);

    // 初始化
    for (int i = 0; i < n; i++) {
        h_data_pageable[i] = (float)i;
        h_data_pinned[i] = (float)i;
    }

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    float ms;

    // 测试pageable内存传输
    cudaEventRecord(start);
    cudaMemcpy(d_data, h_data_pageable, SIZE, cudaMemcpyHostToDevice);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("Pageable H→D: %.2f ms (%.2f GB/s)\n",
           ms, (float)SIZE / ms / 1e6);

    // 测试pinned内存传输
    cudaEventRecord(start);
    cudaMemcpy(d_data, h_data_pinned, SIZE, cudaMemcpyHostToDevice);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("Pinned H→D:   %.2f ms (%.2f GB/s)\n",
           ms, (float)SIZE / ms / 1e6);

    // 异步传输 + 计算重叠（使用CUDA Stream）
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    int half = n / 2;
    size_t halfSize = half * sizeof(float);

    cudaEventRecord(start);
    // Stream 1：传输前半 + 计算前半
    cudaMemcpyAsync(d_data, h_data_pinned, halfSize,
                    cudaMemcpyHostToDevice, stream1);
    dummy_kernel<<<(half + 255) / 256, 256, 0, stream1>>>(d_data, half);

    // Stream 2：传输后半 + 计算后半（与Stream1重叠）
    cudaMemcpyAsync(d_data + half, h_data_pinned + half, halfSize,
                    cudaMemcpyHostToDevice, stream2);
    dummy_kernel<<<(half + 255) / 256, 256, 0, stream2>>>(d_data + half, half);

    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&ms, start, stop);
    printf("异步流水线:   %.2f ms\n", ms);

    // 清理
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    free(h_data_pageable);
    cudaFreeHost(h_data_pinned);
    cudaFree(d_data);
    return 0;
}

Bash

nvcc transfer_optimization.cu -o transfer_opt && ./transfer_opt  # &&前一个成功才执行后一个；||前一个失败才执行

7. 统一内存（Unified Memory）¶

CUDA

// unified_memory.cu - CUDA统一内存
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void square(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * data[idx];
    }
}

int main() {
    int n = 1024;
    float *data;

    // 分配统一内存：CPU和GPU都可以直接访问
    cudaMallocManaged(&data, n * sizeof(float));

    // CPU初始化（直接访问，不需要cudaMemcpy）
    for (int i = 0; i < n; i++) {
        data[i] = (float)i;
    }

    // GPU计算
    square<<<(n + 255) / 256, 256>>>(data, n);
    cudaDeviceSynchronize();  // 必须同步

    // CPU读取结果（直接访问，不需要cudaMemcpy）
    printf("data[0] = %.0f (期望 0)\n", data[0]);
    printf("data[10] = %.0f (期望 100)\n", data[10]);
    printf("data[100] = %.0f (期望 10000)\n", data[100]);

    cudaFree(data);  // 注意：用cudaFree释放，不是free
    return 0;
}

Text Only

┌─────────────────────────────────────────────────────────────┐
│              统一内存 vs 显式管理                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  统一内存（cudaMallocManaged）：                             │
│  ✅ 编程简单，不需要手动拷贝                                 │
│  ✅ 自动处理页面迁移                                        │
│  ❌ 性能可能不如显式管理（页面错误开销）                     │
│  ❌ 首次访问时可能触发缺页                                   │
│  适合：快速原型、内存分配不规则的场景                        │
│                                                             │
│  显式管理（cudaMalloc + cudaMemcpy）：                      │
│  ✅ 性能可预测，可精确控制传输时机                           │
│  ✅ 可使用pinned内存和异步流水线                            │
│  ❌ 编程复杂，需要管理两套指针                               │
│  适合：性能关键型应用、生产环境                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

💡 面试常见问题¶

Q1：什么是合并访问（Coalesced Access）？为什么重要？¶

答：合并访问指同一Warp中的线程访问连续的内存地址，GPU可以合并为一次或少量内存事务。重要性：全局内存延迟高（~400个周期），一次128字节事务读取32个float，如果32个线程各自触发独立事务，带宽利用率仅3%。保持合并访问是GPU性能优化的第一要务。

Q2：共享内存的Bank Conflict是什么？如何避免？¶

答：共享内存分32个Bank，当同一Warp中多个线程访问同一Bank的不同地址时产生冲突，访问串行化。避免方法：①给2D数组加padding（如[32][32+1]错开Bank）；②调整访问模式使线程映射到不同Bank；③多线程访问同一地址反而不冲突（广播）。

Q3：Pinned Memory相比普通内存传输为什么更快？¶

答：普通内存（pageable）传输时，CUDA驱动必须先拷贝到一块临时的锁页缓冲区再DMA传输到GPU。Pinned Memory（锁页内存）直接DMA传输，减少一次拷贝。此外，Pinned Memory支持异步传输（cudaMemcpyAsync），可以实现计算与传输重叠。

Q4：CUDA Stream有什么作用？如何实现传输和计算重叠？¶

答：Stream是GPU上按顺序执行的命令队列。不同Stream的操作可以并发执行。实现重叠：①使用Pinned Memory；②划分数据为多块；③为每块创建一个Stream；④在各Stream中交替执行传输和计算。GPU的DMA引擎和计算引擎独立，可以同时工作。

Q5：什么时候用统一内存（Unified Memory）？什么时候用显式管理？¶

答：统一内存适合：快速原型开发、不规则数据结构（链表、图）、数据量不确定的场景。显式管理适合：性能关键应用、数据传输模式可预测、需要异步流水线优化的场景。在AI训练等大规模计算中，通常使用显式管理以获得最佳性能。

📝 本章小结¶

Text Only

┌─────────────────────────────────────────────┐
│              本章核心知识点                    │
├─────────────────────────────────────────────┤
│                                             │
│  1. GPU内存层次：寄存器→共享→L2→全局      │
│  2. 合并访问是性能优化的第一准则            │
│  3. 共享内存是用户管理的高速缓存            │
│  4. Bank Conflict通过padding避免            │
│  5. Pinned Memory + CUDA Stream流水线       │
│  6. 统一内存简化编程但性能需权衡            │
│                                             │
└─────────────────────────────────────────────┘

回到目录：README.md