第17章 基础设施即代码(Infrastructure as Code)¶
IaC是通过代码定义和管理基础设施的实践,以声明式或命令式方式自动化基础设施的创建、配置和变更。
目录¶
- 1. IaC概述
- 2. Terraform基础
- 3. Terraform进阶
- 4. 状态管理
- 5. Module设计
- 6. Pulumi对比与实战
- 7. IaC最佳实践
- 8. 面试题精选
- 9. 推荐资源
1. IaC概述¶
1.1 为什么需要IaC¶
| 手动管理 | IaC管理 |
|---|---|
| 点击控制台配置 | 代码定义基础设施 |
| 不可重复 | 完全可重复 |
| 无版本历史 | Git版本控制 |
| 容易出错 | 自动化减少人为错误 |
| 环境差异 | 一致性环境 |
| 无法审计 | 完整审计追踪 |
1.2 IaC工具全景¶
Text Only
声明式工具:
┌────────────┬──────────────┬───────────────┐
│ Terraform │ Pulumi │ CloudFormation│
│ (HCL) │ (编程语言) │ (AWS专用) │
├────────────┼──────────────┼───────────────┤
│ Crossplane │ Bicep │ CDK │
│ (K8s原生) │ (Azure专用) │ (AWS编程式) │
└────────────┴──────────────┴───────────────┘
配置管理工具:
┌────────────┬──────────────┬───────────────┐
│ Ansible │ Chef │ Puppet │
│ (无Agent) │ (Ruby DSL) │ (声明式) │
└────────────┴──────────────┴───────────────┘
1.3 声明式 vs 命令式¶
Text Only
声明式(Terraform/CloudFormation):
"我想要3台EC2实例"
→ 工具自动计算当前有几台,差几台就创建几台
命令式(脚本/CDK):
"创建EC2实例 × 3"
→ 重复执行会创建6台
2. Terraform基础¶
2.1 Terraform核心概念¶
Text Only
┌───────────────────────────────────────────┐
│ Terraform 执行流程 │
│ │
│ .tf文件 ──→ terraform plan ──→ 计划 │
│ │ │
│ terraform apply ──→ 执行 │
│ │ │
│ terraform.tfstate ──→ 状态文件 │
│ │ │
│ terraform destroy ──→ 销毁 │
└───────────────────────────────────────────┘
三大核心概念:
- Provider:与云平台/服务交互的插件(AWS/Azure/GCP/K8s)
- Resource:要管理的基础设施对象(EC2/VPC/RDS)
- State:Terraform跟踪的资源实际状态
2.2 HCL语法基础¶
Terraform
# 配置Terraform和Provider
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# 配置Provider
provider "aws" {
region = var.aws_region
}
# 变量定义
variable "aws_region" {
description = "AWS区域"
type = string
default = "us-east-1"
}
variable "instance_type" {
description = "EC2实例类型"
type = string
default = "t3.micro"
validation {
condition = contains(["t3.micro", "t3.small", "t3.medium"], var.instance_type)
error_message = "实例类型必须是t3.micro/small/medium之一"
}
}
variable "environment" {
description = "环境名称"
type = string
}
2.3 Resource定义¶
Terraform
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
}
}
# 子网(基础示例使用 count,生产环境推荐使用 for_each,见 3.1 节)
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 1}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "${var.environment}-public-${count.index + 1}"
}
}
# EC2实例
resource "aws_instance" "web" {
ami = data.aws_ami.amazon_linux.id
instance_type = var.instance_type
subnet_id = aws_subnet.public[0].id
tags = {
Name = "${var.environment}-web-server"
}
lifecycle {
create_before_destroy = true
}
}
# 输出
output "instance_ip" {
description = "Web服务器公网IP"
value = aws_instance.web.public_ip
}
output "vpc_id" {
description = "VPC ID"
value = aws_vpc.main.id
}
2.4 Data Source¶
Terraform
# 查询最新的Amazon Linux 2 AMI
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}
# 查询可用区
data "aws_availability_zones" "available" {
state = "available"
}
# 查询当前账户信息
data "aws_caller_identity" "current" {}
3. Terraform进阶¶
3.1 表达式与函数¶
Terraform
# 条件表达式
resource "aws_instance" "web" {
instance_type = var.environment == "production" ? "t3.large" : "t3.micro"
}
# for_each(推荐替代count,避免索引偏移导致资源重建)
variable "services" {
type = map(object({
port = number
replicas = number
}))
default = {
"frontend" = { port = 80, replicas = 2 }
"backend" = { port = 8080, replicas = 3 }
"worker" = { port = 9090, replicas = 1 }
}
}
resource "aws_ecs_service" "services" {
for_each = var.services
name = each.key
desired_count = each.value.replicas
# ...
}
# 常用内建函数
locals {
common_tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
# 合并映射
all_tags = merge(local.common_tags, var.extra_tags)
# 字符串处理
bucket_name = lower("${var.project_name}-${var.environment}-assets")
# 列表处理
public_subnets = [for s in aws_subnet.public : s.id]
}
3.2 动态块¶
Terraform
resource "aws_security_group" "web" {
name = "${var.environment}-web-sg"
vpc_id = aws_vpc.main.id
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = "tcp"
cidr_blocks = ingress.value.cidr_blocks
description = ingress.value.description
}
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
3.3 Provisioner(谨慎使用)¶
Terraform
resource "aws_instance" "web" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
# 仅在创建时执行(推荐使用user_data替代)
provisioner "remote-exec" {
inline = [
"sudo yum update -y",
"sudo yum install -y docker",
"sudo systemctl start docker"
]
}
# 最佳实践:使用user_data
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y docker
systemctl start docker
EOF
}
4. 状态管理¶
4.1 状态文件的作用¶
Text Only
terraform.tfstate 包含:
├── 资源ID映射(代码名 ↔ 云上实际ID)
├── 资源属性(IP、ARN等)
├── 依赖关系图
└── 元数据(Provider版本等)
⚠ 状态文件可能包含敏感信息(密码、密钥)
永远不要将其提交到Git
4.2 远程状态后端¶
Terraform
# S3 + DynamoDB后端(推荐用于AWS)
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/vpc/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks" # 状态锁
}
}
# Azure Blob Storage后端
terraform {
backend "azurerm" {
resource_group_name = "terraform-state-rg"
storage_account_name = "tfstatesa"
container_name = "tfstate"
key = "production.terraform.tfstate"
}
}
4.3 状态操作命令¶
Bash
# 查看状态中的资源
terraform state list
# 查看特定资源详情
terraform state show aws_instance.web
# 将资源从状态中移除(不删除实际资源)
terraform state rm aws_instance.web
# 导入已有资源到Terraform管理
terraform import aws_instance.web i-1234567890abcdef0
# 移动资源(重命名或移到module)
terraform state mv aws_instance.web module.compute.aws_instance.web
# 状态锁管理
terraform force-unlock <LOCK_ID>
4.4 Workspace隔离¶
Bash
# 使用workspace隔离环境状态
terraform workspace new staging
terraform workspace new production
terraform workspace select production
# 在代码中引用当前workspace
resource "aws_instance" "web" {
tags = {
Environment = terraform.workspace
}
}
5. Module设计¶
5.1 Module结构¶
Text Only
modules/
├── vpc/
│ ├── main.tf # 资源定义
│ ├── variables.tf # 输入变量
│ ├── outputs.tf # 输出值
│ └── README.md # 文档
├── ecs-service/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── versions.tf # Provider版本约束
└── rds/
├── main.tf
├── variables.tf
└── outputs.tf
5.2 Module使用¶
Terraform
# 使用本地Module
module "vpc" {
source = "./modules/vpc"
environment = var.environment
cidr_block = "10.0.0.0/16"
az_count = 3
}
# 使用远程Module(Terraform Registry)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "${var.environment}-cluster"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
}
# Module输出引用
resource "aws_ecs_service" "app" {
cluster = module.ecs.cluster_id
# ...
}
5.3 Module设计原则¶
- 单一职责:每个Module管理一类资源
- 合理抽象:隐藏复杂性但暴露必要参数
- 版本约束:指定Provider最低版本
- 完善文档:variables描述、README示例
- 输出充分:Module外可能需要的ID都要输出
6. Pulumi对比与实战¶
6.1 Terraform vs Pulumi¶
| 维度 | Terraform | Pulumi |
|---|---|---|
| 语言 | HCL(DSL) | Python/TypeScript/Go/C# |
| 状态管理 | S3/远程后端 | Pulumi Cloud/S3 |
| 学习曲线 | 需学HCL | 用熟悉语言 |
| 测试 | terraform test(有限) | 标准单元测试框架 |
| 生态 | Provider最丰富 | 兼容Terraform Provider |
| 复杂逻辑 | HCL表达力有限 | 完整编程语言 |
| 社区 | 最大 | 快速增长 |
6.2 Pulumi示例(Python)¶
Python
import pulumi
import pulumi_aws as aws
config = pulumi.Config()
environment = config.require("environment")
# 创建VPC
vpc = aws.ec2.Vpc("main-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
tags={
"Name": f"{environment}-vpc",
"Environment": environment,
}
)
# 创建子网(利用Python循环)
public_subnets = []
azs = aws.get_availability_zones(state="available")
for i, az in enumerate(azs.names[:2]): # enumerate同时获取索引和值
subnet = aws.ec2.Subnet(f"public-{i}",
vpc_id=vpc.id,
cidr_block=f"10.0.{i+1}.0/24",
availability_zone=az,
map_public_ip_on_launch=True,
tags={"Name": f"{environment}-public-{i+1}"}
)
public_subnets.append(subnet)
# 创建EC2(条件逻辑)
instance_type = "t3.large" if environment == "production" else "t3.micro"
ami = aws.ec2.get_ami(
most_recent=True,
owners=["amazon"],
filters=[{"name": "name", "values": ["amzn2-ami-hvm-*-x86_64-gp2"]}]
)
instance = aws.ec2.Instance("web",
ami=ami.id,
instance_type=instance_type,
subnet_id=public_subnets[0].id,
tags={"Name": f"{environment}-web"}
)
# 导出
pulumi.export("vpc_id", vpc.id)
pulumi.export("instance_ip", instance.public_ip)
6.3 Pulumi测试¶
Python
import unittest
import pulumi
class MyMocks(pulumi.runtime.Mocks):
def new_resource(self, args):
return [args.name + "_id", args.inputs]
def call(self, args):
return {}
pulumi.runtime.set_mocks(MyMocks())
from infra import vpc, instance
class TestInfra(unittest.TestCase):
@pulumi.runtime.test
def test_vpc_has_correct_cidr(self):
def check(cidr):
self.assertEqual(cidr, "10.0.0.0/16")
return vpc.cidr_block.apply(check)
@pulumi.runtime.test
def test_instance_has_tags(self):
def check(tags):
self.assertIn("Name", tags)
return instance.tags.apply(check)
7. IaC最佳实践¶
7.1 代码组织¶
Text Only
infrastructure/
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── staging/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ └── backend.tf
├── modules/
│ ├── networking/
│ ├── compute/
│ └── database/
└── README.md
7.2 CI/CD中的IaC¶
YAML
# GitHub Actions Terraform Pipeline
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform fmt -check
- run: terraform validate
- run: terraform plan -out=plan.tfplan
# PR时输出plan结果到评论
- uses: actions/github-script@v7
if: github.event_name == 'pull_request'
apply:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- run: terraform apply -auto-approve plan.tfplan
7.3 安全检查¶
Bash
# 使用tfsec扫描安全问题
tfsec .
# 使用checkov合规扫描
checkov -d .
# 使用infracost估算成本
infracost breakdown --path .
8. 面试题精选¶
Q1: Terraform的执行流程?¶
terraform init— 初始化Provider插件和后端terraform plan— 比对代码与状态,生成变更计划terraform apply— 执行变更计划terraform destroy— 销毁所有资源
核心机制:Terraform维护一个状态文件,记录代码资源与云上实际资源的映射。每次plan时比对代码/状态/实际三方。
Q2: Terraform状态文件为什么重要?¶
- 资源映射:将代码中的资源名映射到云上实际资源ID
- 性能优化:缓存资源属性,减少API调用
- 依赖追踪:记录资源间依赖关系
- 敏感数据:可能包含密码等敏感信息,需加密存储
- 协作基础:远程状态+锁机制支持团队协作
Q3: 如何管理多环境?¶
| 方案 | 优点 | 缺点 |
|---|---|---|
| Workspace | 简单 | 共享Module难以处理差异 |
| 目录隔离 | 灵活,独立状态 | Module可能重复 |
| Terragrunt | DRY,适合复杂项目 | 额外工具依赖 |
推荐:目录隔离 + 共享Module的方式。
Q4: Terraform import的使用场景?¶
将已存在的云资源纳入Terraform管理: 1. 在.tf文件中写好资源定义 2. terraform import aws_instance.web i-12345 3. terraform plan 验证配置与实际一致 4. 后续该资源完全由Terraform管理
Q5: Terraform和Pulumi如何选型?¶
- 选Terraform:团队运维背景、需要最大生态、HCL够用
- 选Pulumi:团队开发背景、需要复杂逻辑(循环/条件)、想用熟悉语言、重视测试
9. 推荐资源¶
- Terraform官方文档
- Terraform Registry —— Module和Provider库
- Pulumi官方文档
- Terraform Best Practices
- 《Terraform: Up & Running》—— Yevgeniy Brikman
下一步学习:结合16-GitOps实践将IaC集成到GitOps工作流中,实现基础设施变更的全自动化管理。