Kubernetes 降本增效:如何减少 30% 的资源浪费

Kubernetes 降本增效:如何减少 30% 的资源浪费

Kubernetes 降本增效:如何减少 30% 的资源浪费

引言:云成本失控的真相

很多团队在 Kubernetes 上部署应用后,发现云账单越来越贵。根据 Gartner 的研究,K8s 集群平均有 30-40% 的资源被浪费。

今天这篇教程将带你深入掌握 Kubernetes 资源优化策略,让你减少 30% 的资源浪费,同时保持应用稳定运行。

第一章:资源浪费的根源

1.1 常见浪费场景

❌ 资源浪费的常见原因:

  1. 过度配置(Over-provisioning)
  2. 默认请求值过高
  3. CPU 限制设置不合理
  4. 内存预留过多
    1. 资源碎片(Resource Fragmentation)
    2. 节点资源利用不均
    3. 小任务占用大节点
    4. 节点闲置率高
      1. 缺乏自动扩缩容
      2. 固定资源无法应对流量变化
      3. 高峰期资源不足
      4. 低谷期资源闲置
        1. 镜像和配置问题
        2. 大体积镜像
        3. 未优化的容器启动
        4. 过多不必要的应用

1.2 成本数据对比

┌─────────────────────────────────┬──────────────┬──────────────┬────────────┐
│ 优化阶段                        │ 优化前       │ 优化后       │ 节省       │
├─────────────────────────────────┼──────────────┼──────────────┼────────────┤
│ CPU 请求                        │ 100%         │ 45%          │ 55%        │
│ 内存请求                        │ 100%         │ 50%          │ 50%        │
│ 节点数量                        │ 10 个        │ 6 个         │ 40%        │
│ 月成本(以 10 节点集群为例)     │ ¥50,000      │ ¥32,000      │ 36%        │
│ 资源利用率                      │ 35%          │ 65%          │ 86%↑       │
└─────────────────────────────────┴──────────────┴──────────────┴────────────┘

第二章:资源请求与限制

2.1 理解 Requests 和 Limits

“`yaml

资源配置说明

resources:
requests:
cpu: “500m” # 保证的资源(调度依据)
memory: “256Mi” # 保证的资源(调度依据)
limits:
cpu: “1000m” # 最大使用量(可能限制)
memory: “512Mi” # 最大使用量(OOM 风险)


关键概念:
  • Requests:调度时保证的最小资源
  • Limits:容器能使用的最大资源
  • CPU:millicores (1000m = 1 CPU)
  • Memory:MiB (1GiB = 1024MiB)

2.2 合理的资源配置

yaml

❌ 错误的配置示例

apiVersion: v1
kind: Pod
metadata:
name: over-provisioned
spec:
containers:

  • name: app

image: myapp:latest
resources:
requests:
cpu: “4000m”
memory: “8Gi”
limits:
cpu: “8000m”
memory: “16Gi”
# 实际使用:200m CPU, 256Mi 内存

✅ 正确的配置示例

apiVersion: v1
kind: Pod
metadata:
name: optimized
spec:
containers:

  • name: app

image: myapp:latest
resources:
requests:
cpu: “200m”
memory: “256Mi”
limits:
cpu: “500m”
memory: “512Mi”


2.3 基于实际数据的配置

yaml

使用 VPA 获取推荐值

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: “Auto” # Auto, Initial, Recreate


yaml

推荐配置文件

apiVersion: apps/v1
kind: Deployment
metadata:
name: optimized-app
spec:
template:
spec:
containers:

  • name: app

image: myapp:latest
resources:
requests:
cpu: “250m”
memory: “384Mi”
limits:
cpu: “600m”
memory: “768Mi”
# 基于 7 天监控数据配置
# CPU 平均使用:180m, 峰值:450m
# 内存平均使用:320Mi, 峰值:600Mi


2.4 不同场景的推荐配置

yaml

前端服务(高并发)

frontend:
requests:
cpu: “100m”
memory: “128Mi”
limits:
cpu: “500m”
memory: “512Mi”

后端服务(计算密集)

backend:
requests:
cpu: “500m”
memory: “512Mi”
limits:
cpu: “2000m”
memory: “2Gi”

数据库(稳定运行)

database:
requests:
cpu: “1000m”
memory: “4Gi”
limits:
cpu: “4000m”
memory: “8Gi”

批处理任务(低优先级)

batch:
requests:
cpu: “1000m”
memory: “2Gi”
limits:
cpu: “4000m”
memory: “8Gi”
priorityClassName: low-priority


第三章:自动扩缩容策略

3.1 水平自动扩缩容(HPA)

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:

  • type: Resource

resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

  • type: Resource

resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:

  • type: Pods

periodSeconds: 10
value: 4
scaleDown:
stabilizationWindowSeconds: 300
policies:

  • type: Pods

periodSeconds: 60
value: 2


3.2 垂直自动扩缩容(VPA)

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: “Initial” # Auto, Initial, Recreate
resourcePolicy:
containerPolicies:

  • containerName: “*”

minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi


3.3 集群自动扩缩容(Cluster Autoscaler)

yaml

Cluster Autoscaler 配置

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler
namespace: kube-system
data:
CLUSTER_AUTOSCULER_ENABLED: “true”
SKIP_NODES_WITH_LOCAL_STORAGE: “true”
SKIP_NODES_WITH_SYSTEM_DAEMONS: “true”


yaml

Node Pool 配置示例

apiVersion: autoscaling.k8s.io/v1
kind: Cluster Autoscaler
metadata:
name: cluster-autoscaler
spec:
minNodes: 3
maxNodes: 20
nodeGroups:

  • nodeGroup: “default-pool”

minSize: 2
maxSize: 10
instanceTypes:

  • “n1-standard-2”
  • “n1-standard-4”

3.4 自定义指标扩缩容

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: custom-metrics-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 50
metrics:

  • type: Pods

pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: “100”

  • type: Object

object:
metric:
name: queue_depth
describedObject:
apiVersion: apps/v1
kind: Deployment
name: message-queue
target:
type: Value
value: “50”


第四章:资源优化策略

4.1 节点资源优化

yaml

使用 Node Affinity 优化调度

apiVersion: apps/v1
kind: Deployment
metadata:
name: optimized-deployment
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:

  • matchExpressions:
  • key: node-type

operator: In
values: [“compute”]

  • key: zone

operator: In
values: [“us-east-1a”]
tolerations:

  • key: “dedicated”

operator: “Equal”
value: “special-workload”
effect: “NoSchedule”


yaml

使用 Pod Priority 优化资源分配

apiVersion: v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: “Critical workloads”


apiVersion: v1
kind: Pod
metadata:
name: critical-service
spec:
priorityClassName: high-priority
containers:

  • name: app

image: myapp:latest


4.2 镜像优化

dockerfile

❌ 大镜像示例

FROM ubuntu:latest
RUN apt-get update && apt-get install -y \
build-essential \
curl \
wget \
vim \
… # 2GB+ 镜像

✅ 多阶段构建优化

阶段 1:构建

FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o main .

阶段 2:运行时

FROM alpine:3.18
WORKDIR /app
COPY –from=builder /app/main .
RUN adduser -D appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 8080
CMD [“./main”]

结果:50MB 镜像


4.3 资源调度优化

yaml

使用 Pod Disruption Budget 保护

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app

使用 Topology Spread 优化分布

apiVersion: apps/v1
kind: Deployment
metadata:
name: topology-spread
spec:
template:
spec:
topologySpreadConstraints:

  • maxSkew: 1

topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app


4.4 资源监控与分析

yaml

使用 Prometheus 监控

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: resource-optimization
spec:
groups:

  • name: resource-optimization

rules:

  • alert: HighCPUUsage

expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) > 0.8
for: 5m
annotations:
summary: “Pod {{ $labels.pod }} has high CPU usage”


第五章:成本优化实战

5.1 使用 Spot 实例

yaml
apiVersion: v1
kind: Pod
metadata:
name: spot-instance
spec:
nodeSelector:
kubernetes.io/os: linux
node-type: spot
tolerations:

  • key: spot-instance

operator: Equal
value: “true”
effect: NoSchedule
containers:

  • name: app

image: myapp:latest
resources:
requests:
cpu: “1000m”
memory: “2Gi”
limits:
cpu: “2000m”
memory: “4Gi”


5.2 使用 Reserved Instances

yaml

为长期运行的服务预留实例

apiVersion: v1
kind: Pod
metadata:
name: reserved-workload
spec:
nodeSelector:
reserved: “true”
workloads: “stable”
containers:

  • name: app

image: myapp:latest
resources:
requests:
cpu: “2000m”
memory: “4Gi”
limits:
cpu: “4000m”
memory: “8Gi”


5.3 成本对比数据

优化前:

  • 节点:20 个 n1-standard-4
  • CPU 请求:40 cores
  • 内存请求:80 GB
  • 月成本:¥80,000
  • 资源利用率:35%

优化后:

  • 节点:12 个混合配置
  • 6 个 n1-standard-4(核心服务)
  • 4 个 Spot 实例(批处理)
  • 2 个 Reserved 实例(稳定服务)
  • CPU 请求:20 cores
  • 内存请求:40 GB
  • 月成本:¥52,000
  • 资源利用率:68%

节省:¥28,000/月 (35%)


第六章:最佳实践清单

6.1 资源配置最佳实践

✅ 最佳实践:

  1. 设置合理的 Requests 和 Limits
  2. 避免 Requests == Limits
  3. 定期调整资源配置
  4. 使用 VPA 获取推荐值
  5. 监控实际使用情况
  6. 使用资源配额(ResourceQuota)
  7. ❌ 避免:

    1. 过度配置资源
    2. 不设 Limits
    3. 忽略内存 OOM
    4. 不监控资源使用
    5. 忽略成本指标
    6. 
      

      6.2 监控和告警

      yaml
      apiVersion: v1
      kind: ResourceQuota
      metadata:
      name: namespace-quota
      namespace: production
      spec:
      hard:
      requests.cpu: “20”
      requests.memory: “40Gi”
      limits.cpu: “40”
      limits.memory: “80Gi”
      pods: “100”

      
      

      6.3 持续优化策略

      优化周期建议:

      • 每日:检查资源使用情况
      • 每周:调整 VPA 配置
      • 每月:分析成本报告
      • 每季:优化整体架构

      “`

      总结:K8s 资源优化指南

      通过合理的资源优化策略:

      核心优化点:

      1. 精确配置 Requests 和 Limits
      2. 使用 HPA/VPA 自动扩缩容
      3. 优化镜像体积
      4. 合理使用节点和实例
      5. 持续监控和优化
      6. 预期收益:

        • 成本降低:30-40%
        • 资源利用率提升:2 倍
        • 运维效率提升:50%
        • 系统稳定性提升

        下一步行动:

        1. 评估当前资源使用情况
        2. 配置合理的 Requests/Limits
        3. 启用自动扩缩容
        4. 建立监控和告警
        5. 持续优化迭代
        6. 掌握 Kubernetes 资源优化,让你的云成本减半,效率翻倍!🚀

          参考资源:

          • [Kubernetes 资源管理](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/)
          • [Kubernetes 自动扩缩容](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
          • [云成本优化指南](https://aws.amazon.com/blogs/containers/kubernetes-resource-management-and-cost-optimization/)
          • [Spot Instances 文档](https://aws.amazon.com/ec2/spot/)

标签

发表评论