📋 目录
引言:为什么选择容器化PostgreSQL
在数字化转型的浪潮中,数据库作为企业的"心脏",其稳定性和扩展性直接影响着业务的成败。PostgreSQL作为世界上最先进的开源关系型数据库,配合容器化技术,就像是给数据库插上了翅膀——既保持了数据的可靠性,又获得了云原生的灵活性。
为什么PostgreSQL + 容器化是绝配?
- 资源隔离:容器提供完美的资源边界
- 快速部署:从传统的小时级部署缩短到分钟级
- 环境一致性:开发、测试、生产环境完全一致
- 弹性扩展:根据业务负载自动调整资源
让我们一起探索这个令人兴奋的技术组合!
PostgreSQL容器化基础
容器化架构总览
核心组件说明
PostgreSQL主从集群
- Master节点:处理所有写操作和部分读操作
- Slave节点:处理只读查询,提供数据备份
中间件层
- PgPool-II:连接池管理、负载均衡、故障转移
- HAProxy/Nginx:七层负载均衡
存储层
- 持久化卷:使用Kubernetes PV/PVC确保数据持久性
- 存储类:支持动态卷配置
容器镜像准备
自定义PostgreSQL镜像
FROM postgres:14
LABEL maintainer="your-team@company.com"
# 安装必要的扩展
RUN apt-get update && apt-get install -y \
postgresql-14-repmgr \
postgresql-14-pgpool2 \
postgresql-contrib-14 \
&& rm -rf /var/lib/apt/lists/*
# 复制配置文件
COPY postgresql.conf /etc/postgresql/
COPY pg_hba.conf /etc/postgresql/
COPY docker-entrypoint-initdb.d/ /docker-entrypoint-initdb.d/
# 设置权限
RUN chmod +x /docker-entrypoint-initdb.d/*.sh
EXPOSE 5432
构建镜像命令
# 构建镜像
docker build -t your-registry/postgresql:14-cluster .
# 推送到镜像仓库
docker push your-registry/postgresql:14-cluster
分布式架构设计
整体架构图
架构优势
🚀 高性能
- 连接池复用,减少连接开销
- 读写分离,充分利用硬件资源
- 水平分片,突破单机性能瓶颈
🛡️ 高可用
- 多副本冗余,避免单点故障
- 自动故障转移,业务无感知切换
- 跨可用区部署,容灾能力强
📈 高扩展
- 动态添加从节点,应对读压力
- 水平分片扩展,应对数据增长
- 资源弹性伸缩,成本可控
高可用实现方案
故障检测与切换流程
核心配置要点
健康检查配置
# PgPool-II 健康检查
health_check_period = 10 # 10秒检查一次
health_check_timeout = 5 # 5秒超时
health_check_max_retries = 3 # 最多重试3次
故障转移策略
- 自动切换时间:通常设置为30-60秒
- 数据一致性保证:使用同步复制模式
- 切换通知机制:集成企业通知系统
详细部署步骤
第一步:创建命名空间和存储类
# 创建专用命名空间
kubectl create namespace postgresql-cluster
# 创建存储类
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: postgresql-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
fsType: ext4
encrypted: "true"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
EOF
第二步:创建ConfigMap配置
apiVersion: v1
kind: ConfigMap
metadata:
name: postgresql-config
namespace: postgresql-cluster
data:
postgresql.conf: |
# 连接配置
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
# WAL配置
wal_level = replica
max_wal_senders = 5
max_replication_slots = 5
# 日志配置
log_destination = 'stderr'
log_statement = 'mod'
log_duration = on
log_min_duration_statement = 1000
# 性能调优
checkpoint_completion_target = 0.9
wal_buffers = 16MB
maintenance_work_mem = 64MB
pg_hba.conf: |
# 本地连接
local all all trust
# IPv4本地连接
host all all 127.0.0.1/32 md5
# IPv6本地连接
host all all ::1/128 md5
# 集群内部连接
host all all 10.0.0.0/8 md5
host replication replicator 10.0.0.0/8 md5
init-master.sh: |
#!/bin/bash
set -e
# 创建复制用户
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD '$POSTGRES_REPLICATION_PASSWORD';
GRANT CONNECT ON DATABASE $POSTGRES_DB TO replicator;
EOSQL
init-slave.sh: |
#!/bin/bash
set -e
# 等待主节点就绪
until pg_isready -h $PGMASTER_SERVICE -p 5432; do
echo "等待主节点就绪..."
sleep 2
done
# 基础备份
pg_basebackup -h $PGMASTER_SERVICE -D /var/lib/postgresql/data -U replicator -v -P -W
# 创建恢复配置
cat > /var/lib/postgresql/data/recovery.conf <<EOF
standby_mode = 'on'
primary_conninfo = 'host=$PGMASTER_SERVICE port=5432 user=replicator'
trigger_file = '/tmp/postgresql.trigger'
EOF
第三步:部署主节点
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql-master
namespace: postgresql-cluster
spec:
serviceName: postgresql-master
replicas: 1
selector:
matchLabels:
app: postgresql
role: master
template:
metadata:
labels:
app: postgresql
role: master
spec:
containers:
- name: postgresql
image: your-registry/postgresql:14-cluster
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: "production"
- name: POSTGRES_USER
value: "admin"
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: postgres-password
- name: POSTGRES_REPLICATION_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: replication-password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: config
mountPath: /etc/postgresql
- name: init-scripts
mountPath: /docker-entrypoint-initdb.d
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
livenessProbe:
exec:
command:
- pg_isready
- -U
- admin
- -d
- production
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- admin
- -d
- production
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: config
configMap:
name: postgresql-config
items:
- key: postgresql.conf
path: postgresql.conf
- key: pg_hba.conf
path: pg_hba.conf
- name: init-scripts
configMap:
name: postgresql-config
items:
- key: init-master.sh
path: init-master.sh
mode: 0755
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: postgresql-ssd
resources:
requests:
storage: 100Gi
第四步:创建Secret
# 创建密钥
kubectl create secret generic postgresql-secret \
--from-literal=postgres-password=your-super-secret-password \
--from-literal=replication-password=your-replication-password \
-n postgresql-cluster
第五步:创建服务
apiVersion: v1
kind: Service
metadata:
name: postgresql-master
namespace: postgresql-cluster
spec:
selector:
app: postgresql
role: master
ports:
- port: 5432
targetPort: 5432
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-slaves
namespace: postgresql-cluster
spec:
selector:
app: postgresql
role: slave
ports:
- port: 5432
targetPort: 5432
type: ClusterIP
读写分离架构
读写分离流程图
实现细节
智能路由规则
- 写操作:INSERT、UPDATE、DELETE → 主节点
- 读操作:SELECT → 从节点(负载均衡)
- 事务一致性:事务内所有操作路由到主节点
延迟控制
- 同步复制:关键业务,零延迟
- 异步复制:一般业务,秒级延迟
- 延迟监控:超过阈值自动摘除节点
PgPool-II详细配置
第一步:部署从节点集群
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql-slaves
namespace: postgresql-cluster
spec:
serviceName: postgresql-slaves
replicas: 2
selector:
matchLabels:
app: postgresql
role: slave
template:
metadata:
labels:
app: postgresql
role: slave
spec:
containers:
- name: postgresql
image: your-registry/postgresql:14-cluster
ports:
- containerPort: 5432
env:
- name: PGMASTER_SERVICE
value: "postgresql-master"
- name: POSTGRES_DB
value: "production"
- name: POSTGRES_USER
value: "admin"
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: postgres-password
- name: POSTGRES_REPLICATION_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: replication-password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: init-scripts
mountPath: /docker-entrypoint-initdb.d
command:
- /docker-entrypoint-initdb.d/init-slave.sh
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
volumes:
- name: init-scripts
configMap:
name: postgresql-config
items:
- key: init-slave.sh
path: init-slave.sh
mode: 0755
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: postgresql-ssd
resources:
requests:
storage: 100Gi
第二步:PgPool-II配置
apiVersion: v1
kind: ConfigMap
metadata:
name: pgpool-config
namespace: postgresql-cluster
data:
pgpool.conf: |
# 连接池配置
listen_addresses = '*'
port = 5432
num_init_children = 20
max_pool = 4
# 负载均衡配置
load_balance_mode = on
master_slave_mode = on
master_slave_sub_mode = 'stream'
# 后端数据库配置
backend_hostname0 = 'postgresql-master'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/postgresql/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'postgresql-slaves-0.postgresql-slaves'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/postgresql/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_hostname2 = 'postgresql-slaves-1.postgresql-slaves'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/var/lib/postgresql/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
# 健康检查配置
health_check_period = 10
health_check_timeout = 5
health_check_user = 'admin'
health_check_password = ''
health_check_database = 'production'
health_check_max_retries = 3
health_check_retry_delay = 1
# 故障转移配置
failover_command = '/etc/pgpool/failover.sh %d %h %p %D %m %H %M %P %r %R'
follow_master_command = '/etc/pgpool/follow_master.sh %d %h %p %D %m %H %M %P %r %R'
# 在线恢复配置
recovery_user = 'postgres'
recovery_password = ''
recovery_1st_stage_command = 'recovery_1st_stage'
pcp.conf: |
# PCP配置文件
admin:e10adc3949ba59abbe56e057f20f883e
pool_hba.conf: |
# HBA配置
local all all trust
host all all 0.0.0.0/0 md5
host all all ::/0 md5
failover.sh: |
#!/bin/bash
# 故障转移脚本
NODE_ID=$1
HOSTNAME=$2
PORT=$3
DATABASE_DIR=$4
NEW_MASTER=$5
echo "$(date): Failover triggered for node $NODE_ID ($HOSTNAME:$PORT)" >> /var/log/pgpool/failover.log
# 这里可以添加通知逻辑,如发送邮件、Slack消息等
curl -X POST "https://hooks.slack.com/your-webhook" \
-H 'Content-type: application/json' \
--data "{\"text\":\"PostgreSQL节点 $HOSTNAME:$PORT 发生故障,已触发故障转移\"}"
第三步:部署PgPool-II
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgpool
namespace: postgresql-cluster
spec:
replicas: 2
selector:
matchLabels:
app: pgpool
template:
metadata:
labels:
app: pgpool
spec:
containers:
- name: pgpool
image: pgpool/pgpool:4.3
ports:
- containerPort: 5432
- containerPort: 9999
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-secret
key: postgres-password
volumeMounts:
- name: pgpool-config
mountPath: /etc/pgpool
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: pgpool-config
configMap:
name: pgpool-config
defaultMode: 0755
---
apiVersion: v1
kind: Service
metadata:
name: pgpool
namespace: postgresql-cluster
spec:
selector:
app: pgpool
ports:
- name: postgresql
port: 5432
targetPort: 5432
- name: pcp
port: 9999
targetPort: 9999
type: LoadBalancer
第四步:验证读写分离
# 连接到PgPool进行测试
kubectl exec -it deployment/pgpool -n postgresql-cluster -- psql -h localhost -U admin -d production
# 测试写操作(应该路由到主节点)
production=# INSERT INTO test_table (name) VALUES ('test');
# 测试读操作(应该路由到从节点)
production=# SELECT * FROM test_table;
# 查看连接分布
production=# SHOW POOL_NODES;
动态扩缩容策略
自动扩缩容架构
扩缩容触发条件
扩容场景
- CPU使用率 > 70%,持续5分钟
- 内存使用率 > 80%,持续3分钟
- 连接数 > 最大连接数的85%
- 查询响应时间 > 500ms,持续2分钟
缩容场景
- CPU使用率 < 30%,持续15分钟
- 内存使用率 < 40%,持续10分钟
- 连接数 < 最大连接数的20%
- 业务低峰期:凌晨2-6点
扩缩容安全机制
渐进式扩容
# 扩容策略
scaleUp:
stabilizationWindowSeconds: 60 # 稳定窗口60秒
policies:
- type: Percent
value: 50 # 每次最多扩容50%
periodSeconds: 60
保护性缩容
# 缩容策略
scaleDown:
stabilizationWindowSeconds: 300 # 稳定窗口5分钟
policies:
- type: Pods
value: 1 # 每次最多缩容1个Pod
periodSeconds: 180
详细实施步骤
第一步:部署Metrics Server
# 安装Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# 验证安装
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
第二步:配置HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: postgresql-slaves-hpa
namespace: postgresql-cluster
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgresql-slaves
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 180
第三步:自定义指标配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-postgresql-exporter
namespace: postgresql-cluster
data:
config.yaml: |
datasource:
host: postgresql-master
user: postgres_exporter
password: exporter_password
database: production
sslmode: disable
queries:
- name: pg_connections
query: "SELECT count(*) as connections FROM pg_stat_activity WHERE state = 'active'"
master: true
metrics:
- connections:
usage: "GAUGE"
description: "Number of active connections"
- name: pg_replication_lag
query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag"
master: false
metrics:
- lag:
usage: "GAUGE"
description: "Replication lag in seconds"
- name: pg_slow_queries
query: "SELECT count(*) as slow_queries FROM pg_stat_statements WHERE mean_time > 1000"
master: true
metrics:
- slow_queries:
usage: "GAUGE"
description: "Number of slow queries"
第四步:部署自定义监控
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgresql-exporter
namespace: postgresql-cluster
spec:
replicas: 1
selector:
matchLabels:
app: postgresql-exporter
template:
metadata:
labels:
app: postgresql-exporter
spec:
containers:
- name: postgres-exporter
image: prometheuscommunity/postgres-exporter:v0.11.1
ports:
- containerPort: 9187
env:
- name: DATA_SOURCE_NAME
value: "postgresql://postgres_exporter:exporter_password@postgresql-master:5432/production?sslmode=disable"
- name: PG_EXPORTER_EXTEND_QUERY_PATH
value: "/etc/postgres_exporter/config.yaml"
volumeMounts:
- name: config
mountPath: /etc/postgres_exporter
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: config
configMap:
name: prometheus-postgresql-exporter
第五步:配置基于自定义指标的HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: postgresql-custom-hpa
namespace: postgresql-cluster
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgresql-slaves
minReplicas: 2
maxReplicas: 15
metrics:
- type: External
external:
metric:
name: postgresql_connections_active
target:
type: AverageValue
averageValue: "80"
- type: External
external:
metric:
name: postgresql_replication_lag_seconds
target:
type: AverageValue
averageValue: "5"
behavior:
scaleUp:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 300
第六步:自动扩缩容验证脚本
#!/bin/bash
# 扩缩容测试脚本
echo "开始扩缩容测试..."
# 获取当前副本数
CURRENT_REPLICAS=$(kubectl get statefulset postgresql-slaves -n postgresql-cluster -o jsonpath='{.spec.replicas}')
echo "当前从节点副本数: $CURRENT_REPLICAS"
# 模拟高负载
echo "启动负载测试..."
kubectl run load-test --image=postgres:14 -n postgresql-cluster --rm -it --restart=Never -- \
bash -c "
for i in {1..1000}; do
psql -h pgpool -U admin -d production -c 'SELECT count(*) FROM pg_stat_activity;' &
done
wait
"
# 等待HPA生效
echo "等待HPA自动扩容..."
sleep 120
# 检查扩容结果
NEW_REPLICAS=$(kubectl get statefulset postgresql-slaves -n postgresql-cluster -o jsonpath='{.spec.replicas}')
echo "扩容后从节点副本数: $NEW_REPLICAS"
# 检查HPA状态
kubectl get hpa postgresql-custom-hpa -n postgresql-cluster
echo "扩缩容测试完成"
生产环境实践
部署清单示例
主节点部署
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql-master
spec:
replicas: 1
template:
spec:
containers:
- name: postgresql
image: postgres:14
env:
- name: POSTGRES_DB
value: "production"
- name: POSTGRES_USER
value: "admin"
- name: POSTGRES_REPLICATION_USER
value: "replicator"
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
性能调优要点
连接池优化
- max_connections: 200-500(根据硬件配置)
- shared_buffers: 物理内存的25%
- effective_cache_size: 物理内存的75%
日志配置
- log_statement: ‘mod’(记录修改操作)
- log_duration: on(记录查询耗时)
- log_slow_queries: 开启慢查询日志
监控告警配置
关键指标监控
- 可用性:主从复制延迟、连接成功率
- 性能:QPS、平均响应时间、慢查询数量
- 资源:CPU、内存、磁盘IO、网络IO
告警规则示例
groups:
- name: postgresql.rules
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL实例下线"
- alert: PostgreSQLReplicationLag
expr: pg_replication_lag_seconds > 10
for: 2m
labels:
severity: warning
annotations:
summary: "主从复制延迟过高"
完整监控体系部署
第一步:部署Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: postgresql-cluster
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.40.0
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
command:
- /bin/prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --storage.tsdb.retention.time=15d
- --web.enable-lifecycle
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumes:
- name: config
configMap:
name: prometheus-config
- name: data
persistentVolumeClaim:
claimName: prometheus-data
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: postgresql-cluster
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/postgresql.rules"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['postgresql-exporter:9187']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'pgpool'
static_configs:
- targets: ['pgpool:9999']
scrape_interval: 15s
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- postgresql-cluster
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
postgresql.rules: |
groups:
- name: postgresql.rules
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL实例 {{ $labels.instance }} 下线"
description: "PostgreSQL实例已经下线超过1分钟"
- alert: PostgreSQLReplicationLag
expr: pg_replication_lag_seconds > 10
for: 2m
labels:
severity: warning
annotations:
summary: "PostgreSQL主从复制延迟过高"
description: "复制延迟 {{ $value }} 秒,超过10秒阈值"
- alert: PostgreSQLHighConnections
expr: pg_stat_activity_count > 150
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL连接数过高"
description: "当前连接数 {{ $value }},接近最大连接数限制"
- alert: PostgreSQLSlowQueries
expr: rate(pg_stat_statements_mean_time_seconds[5m]) > 1
for: 3m
labels:
severity: warning
annotations:
summary: "PostgreSQL存在慢查询"
description: "平均查询时间 {{ $value }} 秒,超过1秒阈值"
第二步:部署Grafana仪表板
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: postgresql-cluster
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:9.3.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123"
- name: GF_INSTALL_PLUGINS
value: "grafana-piechart-panel"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-config
mountPath: /etc/grafana/provisioning
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-data
- name: grafana-config
configMap:
name: grafana-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: postgresql-cluster
data:
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
dashboards.yml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
第三步:故障排查工具脚本
#!/bin/bash
# PostgreSQL集群故障排查脚本
echo "=== PostgreSQL集群健康检查 ==="
# 检查Pod状态
echo "1. 检查Pod状态:"
kubectl get pods -n postgresql-cluster
# 检查服务状态
echo -e "\n2. 检查服务状态:"
kubectl get svc -n postgresql-cluster
# 检查PVC状态
echo -e "\n3. 检查存储状态:"
kubectl get pvc -n postgresql-cluster
# 检查主从复制状态
echo -e "\n4. 检查主从复制状态:"
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "SELECT * FROM pg_stat_replication;"
# 检查从节点延迟
echo -e "\n5. 检查从节点延迟:"
kubectl exec postgresql-slaves-0 -n postgresql-cluster -- \
psql -U admin -d production -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag;"
# 检查连接数
echo -e "\n6. 检查当前连接数:"
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "SELECT count(*) FROM pg_stat_activity;"
# 检查PgPool状态
echo -e "\n7. 检查PgPool节点状态:"
kubectl exec deployment/pgpool -n postgresql-cluster -- \
psql -h localhost -p 9999 -U admin -c "SHOW POOL_NODES;"
# 检查慢查询
echo -e "\n8. 检查慢查询(TOP 5):"
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "
SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 5;"
# 检查表空间使用情况
echo -e "\n9. 检查数据库大小:"
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
WHERE datistemplate = false;"
echo -e "\n=== 健康检查完成 ==="
第四步:备份恢复脚本
#!/bin/bash
# PostgreSQL自动备份脚本
NAMESPACE="postgresql-cluster"
BACKUP_DIR="/backup/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_NAME="postgresql_backup_$DATE"
echo "开始备份PostgreSQL数据库..."
# 创建备份目录
mkdir -p $BACKUP_DIR
# 执行逻辑备份
kubectl exec postgresql-master-0 -n $NAMESPACE -- \
pg_dumpall -U admin > $BACKUP_DIR/$BACKUP_NAME.sql
# 压缩备份文件
gzip $BACKUP_DIR/$BACKUP_NAME.sql
# 上传到对象存储(示例使用AWS S3)
aws s3 cp $BACKUP_DIR/$BACKUP_NAME.sql.gz s3://your-backup-bucket/postgresql/
# 清理本地文件
rm $BACKUP_DIR/$BACKUP_NAME.sql.gz
# 保留最近30天的备份
find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete
echo "备份完成: $BACKUP_NAME.sql.gz"
# 验证备份完整性
echo "验证备份完整性..."
gunzip -t $BACKUP_DIR/$BACKUP_NAME.sql.gz
if [ $? -eq 0 ]; then
echo "备份文件完整性验证通过"
else
echo "备份文件损坏,请检查!"
exit 1
fi
第五步:性能优化配置
# PostgreSQL性能调优ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: postgresql-performance-config
namespace: postgresql-cluster
data:
postgresql-performance.conf: |
# 内存配置
shared_buffers = 256MB # 共享缓冲区,通常设为内存的25%
effective_cache_size = 1GB # 系统缓存大小,通常设为内存的75%
work_mem = 4MB # 单个操作的工作内存
maintenance_work_mem = 64MB # 维护操作的工作内存
# 检查点配置
checkpoint_completion_target = 0.9 # 检查点完成目标
wal_buffers = 16MB # WAL缓冲区大小
# 查询规划器配置
random_page_cost = 1.1 # 随机页面访问成本
effective_io_concurrency = 200 # 并发IO数
# 日志配置
log_min_duration_statement = 1000 # 记录超过1秒的查询
log_checkpoints = on # 记录检查点
log_connections = on # 记录连接
log_disconnections = on # 记录断开连接
log_lock_waits = on # 记录锁等待
# 统计信息
track_io_timing = on # 跟踪IO时间
track_functions = pl # 跟踪函数调用
# 连接配置
max_connections = 200 # 最大连接数
superuser_reserved_connections = 3 # 超级用户保留连接数
# 自动清理配置
autovacuum = on # 开启自动清理
autovacuum_max_workers = 3 # 自动清理工作进程数
autovacuum_naptime = 60s # 自动清理间隔
第六步:集群升级脚本
#!/bin/bash
# PostgreSQL集群滚动升级脚本
NAMESPACE="postgresql-cluster"
OLD_IMAGE="your-registry/postgresql:14-cluster"
NEW_IMAGE="your-registry/postgresql:15-cluster"
echo "开始PostgreSQL集群滚动升级..."
# 备份当前配置
kubectl get statefulset postgresql-master -n $NAMESPACE -o yaml > master-backup.yaml
kubectl get statefulset postgresql-slaves -n $NAMESPACE -o yaml > slaves-backup.yaml
# 首先升级从节点
echo "升级从节点..."
kubectl patch statefulset postgresql-slaves -n $NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"postgresql","image":"'$NEW_IMAGE'"}]}}}}'
# 等待从节点升级完成
kubectl rollout status statefulset/postgresql-slaves -n $NAMESPACE
# 验证从节点健康状态
echo "验证从节点状态..."
for i in {0..1}; do
kubectl exec postgresql-slaves-$i -n $NAMESPACE -- pg_isready
if [ $? -ne 0 ]; then
echo "从节点 postgresql-slaves-$i 升级失败"
exit 1
fi
done
# 升级主节点(需要短暂的服务中断)
echo "升级主节点..."
kubectl patch statefulset postgresql-master -n $NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"postgresql","image":"'$NEW_IMAGE'"}]}}}}'
# 等待主节点升级完成
kubectl rollout status statefulset/postgresql-master -n $NAMESPACE
# 验证主节点健康状态
kubectl exec postgresql-master-0 -n $NAMESPACE -- pg_isready
if [ $? -ne 0 ]; then
echo "主节点升级失败,开始回滚..."
kubectl apply -f master-backup.yaml
exit 1
fi
# 验证集群整体状态
echo "验证集群状态..."
kubectl exec postgresql-master-0 -n $NAMESPACE -- \
psql -U admin -d production -c "SELECT version();"
echo "PostgreSQL集群升级完成"
总结与展望
通过本文的技术方案,我们成功构建了一个高可用、可扩展、易维护的PostgreSQL容器化分布式系统。这套方案的核心优势:
🎯 核心收益
业务层面
- 可用性提升:从99.9%提升到99.99%
- 性能提升:读写分离后读性能提升3-5倍
- 运维效率:自动化运维,人力成本降低60%
技术层面
- 弹性扩展:根据业务负载自动调整资源
- 故障自愈:分钟级故障恢复,业务无感知
- 统一管控:云原生工具链,管理更简单
🔮 未来发展方向
随着云原生技术的不断发展,PostgreSQL容器化还有更多可能:
智能化运维
- AI驱动的性能调优
- 智能故障预测与预防
- 自适应资源分配
多云部署
- 跨云厂商部署
- 混合云数据同步
- 边缘计算场景支持
新技术融合
- Serverless数据库
- 向量数据库集成
- 图数据库扩展
📋 常见问题排查指南
问题1:从节点复制延迟过高
# 检查网络延迟
kubectl exec postgresql-slaves-0 -n postgresql-cluster -- \
ping postgresql-master
# 检查主节点WAL生成速度
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "
SELECT pg_current_wal_lsn(),
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') as wal_bytes;"
# 调整同步模式
kubectl patch configmap postgresql-config -n postgresql-cluster -p '
{
"data": {
"postgresql.conf": "synchronous_commit = off\nwal_level = replica\n..."
}
}'
问题2:连接池耗尽
# 检查当前连接数
kubectl exec deployment/pgpool -n postgresql-cluster -- \
psql -h localhost -p 9999 -U admin -c "SHOW POOL_PROCESSES;"
# 调整连接池配置
kubectl patch configmap pgpool-config -n postgresql-cluster -p '
{
"data": {
"pgpool.conf": "num_init_children = 50\nmax_pool = 8\n..."
}
}'
# 重启PgPool应用配置
kubectl rollout restart deployment/pgpool -n postgresql-cluster
问题3:磁盘空间不足
# 检查磁盘使用情况
kubectl exec postgresql-master-0 -n postgresql-cluster -- df -h
# 扩展PVC容量
kubectl patch pvc data-postgresql-master-0 -n postgresql-cluster -p '
{
"spec": {
"resources": {
"requests": {
"storage": "200Gi"
}
}
}
}'
# 清理过期WAL文件
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "SELECT pg_switch_wal();"
问题4:主节点故障转移
# 手动触发故障转移
kubectl exec deployment/pgpool -n postgresql-cluster -- \
pcp_promote_node -h localhost -p 9999 -U admin -n 1
# 验证新主节点状态
kubectl exec postgresql-slaves-0 -n postgresql-cluster -- \
psql -U admin -d production -c "SELECT pg_is_in_recovery();"
# 重建原主节点为从节点
kubectl delete pod postgresql-master-0 -n postgresql-cluster
🛠️ 生产环境最佳实践
资源配置建议
# 生产环境资源配置
resources:
requests:
cpu: 4000m # 4核CPU基础配置
memory: 8Gi # 8GB内存基础配置
limits:
cpu: 8000m # 8核CPU峰值配置
memory: 16Gi # 16GB内存峰值配置
# 存储配置
storage: 500Gi # 根据数据增长预期配置
storageClass: ssd # 使用SSD存储提升性能
安全加固配置
# 网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgresql-network-policy
namespace: postgresql-cluster
spec:
podSelector:
matchLabels:
app: postgresql
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: pgpool
ports:
- protocol: TCP
port: 5432
egress:
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
定期维护任务
#!/bin/bash
# 定期维护脚本(建议每周执行)
echo "=== PostgreSQL集群定期维护 ==="
# 1. 数据库统计信息更新
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "ANALYZE;"
# 2. 重建索引(仅在必要时)
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "REINDEX DATABASE production;"
# 3. 清理过期连接
kubectl exec deployment/pgpool -n postgresql-cluster -- \
pcp_proc_info -h localhost -p 9999 -U admin
# 4. 检查表膨胀
kubectl exec postgresql-master-0 -n postgresql-cluster -- \
psql -U admin -d production -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname NOT IN ('information_schema', 'pg_catalog')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"
# 5. 备份验证
./backup-verify.sh
echo "=== 定期维护完成 ==="
📊 性能监控仪表板
核心指标面板
- 数据库连接数:实时连接数 vs 最大连接数
- 查询性能:QPS、平均响应时间、慢查询统计
- 复制延迟:主从延迟趋势图
- 资源使用:CPU、内存、磁盘IO使用率
- 错误率:连接失败率、查询错误率
告警阈值建议
# 推荐的告警阈值配置
alerts:
cpu_usage: 80% # CPU使用率超过80%
memory_usage: 85% # 内存使用率超过85%
disk_usage: 90% # 磁盘使用率超过90%
replication_lag: 5s # 复制延迟超过5秒
connections: 85% # 连接数超过最大值的85%
slow_queries: 10/min # 慢查询超过每分钟10个
写在最后
数据库的容器化之路并非一帆风顺,但正如PostgreSQL的标语"The world’s most advanced open source database"一样,通过不断的技术创新和实践积累,我们正在让数据库变得更加智能、可靠和高效。
希望这套技术方案能为你的项目带来价值,让我们一起在云原生的道路上,构建更加美好的数字世界!
实施建议
- 从小规模开始:先在测试环境验证方案可行性
- 逐步扩展:按照业务需求逐步增加节点和功能
- 持续监控:建立完善的监控体系,及时发现问题
- 定期演练:定期进行故障转移和恢复演练
- 文档维护:保持运维文档的及时更新
关于作者
如果你对PostgreSQL容器化技术有任何疑问或想法,欢迎在评论区交流讨论。让我们一起推动数据库技术的发展!
相关资源