Spring Boot 全链路监控系统建设指南,包含从底层原理到生产级部署的完整细节,分为七个核心模块

发布于:2025-07-12 ⋅ 阅读:(52) ⋅ 点赞:(0)

一、监控体系架构深度解析

1.1 现代监控技术栈分层

应用层
Spring Boot Actuator
Micrometer
Prometheus/InfluxDB
Grafana
AlertManager
企业微信/钉钉

1.2 指标采集原理

  • Micrometer 工作流程:
// 指标注册核心逻辑
public class MicrometerRegistry {
    void registerMeter(Meter meter) {
        // 1. 指标类型检查(Counter/Gauge/Timer等)
        // 2. 添加标签(Tag)
        // 3. 发布到Prometheus/InfluxDB等适配器
    }
}
  • Prometheus 抓取机制:
# Prometheus 抓取时序数据流程
1. 定时发送HTTP GET到/actuator/prometheus
2. 解析文本格式的metrics数据
3. 存储到TSDB时序数据库
4. 每2小时压缩一次block

二、Spring Boot 监控配置全解

2.1 精细化指标暴露控制

management:
  metrics:
    export:
      prometheus:
        step: 1m  # 指标聚合间隔
        descriptions: true  # 保留指标描述
    enable:
      jvm: true
      logback: false  # 关闭不必要指标
    distribution:
      percentiles: [0.5, 0.95, 0.99]  # 自定义分位数
  endpoint:
    health:
      show-details: always
      probes:
        enabled: true  # K8s存活探针专用

2.2 自定义指标开发实战

// 1. 创建自定义指标
@Bean
public MeterBinder orderMetrics(OrderRepository repo) {
    return registry -> {
        Gauge.builder("order.count", repo, OrderRepository::count)
            .tag("region", System.getenv("REGION"))
            .register(registry);
    };
}

// 2. 复杂业务计时
@Aspect
@Component
public class ServiceMonitor {
    private final Timer serviceTimer = Timer.builder("service.time")
        .publishPercentiles(0.95)
        .register(Metrics.globalRegistry);

    @Around("execution(* com..*Service.*(..))")
    public Object timeService(ProceedingJoinPoint pjp) throws Throwable {
        return serviceTimer.record(() -> pjp.proceed());
    }
}

三、Prometheus 高级配置

3.1 存储优化方案

# prometheus.yml 关键参数
storage:
  tsdb:
    retention: 30d
    block_duration: 2h  # 块压缩周期
remote_write:
  - url: "http://thanos:10908/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 200

3.2 联邦集群部署

数据中心B
数据中心A
Thanos Receiver
Prometheus-B
Thanos Receiver
Prometheus-A
Thanos Query
Grafana

四、Grafana 看板开发进阶

4.1 JVM 内存分析模板

-- 堆内存压力公式
sum(jvm_memory_used_bytes{area="heap"}) by (instance) / 
sum(jvm_memory_max_bytes{area="heap"}) by (instance)

-- 内存泄漏检测
rate(jvm_memory_used_bytes{area="heap"}[1h]) > 100000000  # 1小时内增长超过100MB

4.2 分布式追踪集成

# application.yml 追加配置
management:
  tracing:
    sampling:
      probability: 0.1  # 采样率10%
  zipkin:
    endpoint: http://zipkin:9411/api/v2/spans

五、告警体系设计

5.1 多级告警规则

# alert.rules.yml
groups:
- name: critical
  rules:
  - alert: HighErrorRate
    expr: rate(http_server_errors_total[5m]) > 10
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate is {{ $value }}"

- name: warning
  rules:
  - alert: MemoryLeakWarning
    expr: predict_linear(jvm_memory_used_bytes[6h], 86400) > jvm_memory_max_bytes
    labels:
      severity: warning

5.2 告警路由策略

# alertmanager.yml
route:
  group_by: ['alertname']
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: 'critical'
    receiver: 'sms-alert'
receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXX
- name: 'sms-alert'
  webhook_configs:
  - url: http://sms-gateway/api

六、性能优化实战

6.1 指标采集降载方案

// 自定义采样过滤器
@Bean
public MeterFilter samplingFilter() {
    return MeterFilter.filter(MeterFilter.deny(id -> {
        String uri = id.getTag("uri");
        return uri != null && uri.startsWith("/actuator");
    })).sample(
        Sample.of(100).withProbability(0.5)  // 50%采样率
    );
}

6.2 高并发场景优化

# Netty 专属配置
reactor:
  netty:
    resources:
      max-connections: 50000
      max-idle-time: 30s
    metrics:
      enabled: true
      binders: ["jvm", "reactor"]

七、生产环境部署清单

7.1 K8s 部署模板

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spring-boot-app
spec:
  template:
    spec:
      containers:
      - name: app
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
        resources:
          limits:
            memory: 2Gi
          requests:
            memory: 1Gi

7.2 监控组件资源规划

组件 CPU 内存 存储
Prometheus 4核 16GB 500GB
Grafana 2核 4GB 50GB
AlertManager 1核 2GB -

八、全链路监控实战案例

8.1 电商大促监控场景

-- 实时大屏指标
SELECT 
  sum(order_count) OVER (ORDER BY time DESC LIMIT 5) AS recent_orders,
  avg(payment_latency) FILTER(WHERE status='paid') AS avg_pay_time
FROM metrics
WHERE time > now() - 1h
GROUP BY 1m

8.2 金融交易监控

// 资金操作审计指标
@Transactional
public void transfer(Account from, Account to, BigDecimal amount) {
    Metrics.counter("transfer.count", 
        "currency", from.getCurrency())
        .increment();
    
    Timer.Sample sample = Timer.start();
    try {
        // 业务逻辑
    } finally {
        sample.stop(Metrics.timer("transfer.time"));
    }
}

网站公告

今日签到

点亮在社区的每一天
去签到