这套组合是当前Java生态中最流行的监控解决方案之一,特别适合云原生环境下的微服务应用监控。下面我将从技术实现到最佳实践进行全面解析。
一、技术栈组成与协作
1. 组件分工
组件 | 角色 | 关键能力 |
---|---|---|
Micrometer | 应用指标门面(Facade) | 统一指标采集API,对接多种监控系统 |
Prometheus | 时序数据库+采集器 | 指标存储、查询、告警规则处理 |
Grafana | 可视化平台 | 仪表盘展示、数据可视化分析 |
2. 数据流动
二、Micrometer 集成实践
1. Spring Boot 配置
Maven依赖:
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
application.yml配置:
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name} # 统一添加应用标签
2. 自定义指标示例
业务指标采集:
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderProcessingTimer;
public OrderService(MeterRegistry registry) {
// 创建计数器
orderCounter = Counter.builder("orders.total")
.description("Total number of orders")
.tag("type", "online")
.register(registry);
// 创建计时器
orderProcessingTimer = Timer.builder("orders.processing.time")
.description("Order processing time")
.publishPercentiles(0.5, 0.95) // 50%和95%分位
.register(registry);
}
public void processOrder(Order order) {
// 方法1: 手动计时
long start = System.currentTimeMillis();
try {
// 业务逻辑...
orderCounter.increment();
} finally {
long duration = System.currentTimeMillis() - start;
orderProcessingTimer.record(duration, TimeUnit.MILLISECONDS);
}
// 方法2: 使用Lambda自动计时
orderProcessingTimer.record(() -> {
// 业务逻辑...
orderCounter.increment();
});
}
}
三、Prometheus 配置优化
1. 抓取配置示例
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 30s
scrape_configs:
- job_name: 'spring-apps'
metrics_path: '/actuator/prometheus'
scrape_interval: 10s # 对应用更频繁采集
static_configs:
- targets: ['app1:8080', 'app2:8080']
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__meta_service_name]
target_label: service
2. 关键优化参数
存储配置:
# 控制块存储行为
storage:
tsdb:
retention: 15d # 数据保留时间
out_of_order_time_window: 1h # 允许乱序数据窗口
# 限制内存使用
query:
lookback-delta: 5m
max-concurrency: 20
四、Grafana 仪表盘设计
1. 核心监控仪表盘
JVM监控面板配置:
Panel 1: Heap Memory Usage
Query: sum(jvm_memory_used_bytes{area="heap"}) by (instance) / sum(jvm_memory_max_bytes{area="heap"}) by (instance)
Visualization: Time series with % unit
Panel 2: GC Pause Time
Query: rate(jvm_gc_pause_seconds_sum[1m])
Visualization: Heatmap
Panel 3: Thread States
Query: jvm_threads_states_threads{instance=~"$instance"}
Visualization: Stacked bar chart
2. 业务指标可视化
订单业务看板:
{
"panels": [
{
"title": "Orders per Minute",
"targets": [{
"expr": "rate(orders_total[1m])",
"legendFormat": "{{instance}}"
}],
"type": "graph",
"yaxes": [{"format": "ops"}]
},
{
"title": "Processing Time (95%)",
"targets": [{
"expr": "histogram_quantile(0.95, rate(orders_processing_time_seconds_bucket[1m]))",
"legendFormat": "P95"
}],
"type": "stat",
"unit": "s"
}
]
}
五、生产环境最佳实践
1. 指标命名规范
类型 | 前缀 | 示例 |
---|---|---|
计数器 | _total |
http_requests_total |
计量器 | _current |
queue_size_current |
计时器 | _seconds |
api_latency_seconds |
分布概要 | _summary |
response_size_summary |
2. 标签使用原则
- 避免高基数标签:如用户ID、订单号等
- 统一标签命名:团队内保持一致(如
env
vsenvironment
) - 重要维度标记:
region
、az
、service_version
等
3. 资源优化技巧
Micrometer配置:
@Bean
MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config()
.meterFilter(
MeterFilter.deny(id -> {
// 过滤不需要的指标
return id.getName().startsWith("jvm_classes");
}))
.commonTags("region", System.getenv("AWS_REGION"));
}
Prometheus资源限制:
# 容器部署时设置资源限制
resources:
limits:
memory: 8Gi
requests:
cpu: 2
memory: 4Gi
六、高级功能实现
1. 自定义Collector
public class CustomMetricsCollector extends Collector {
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples> samples = new ArrayList<>();
// 添加自定义指标
samples.add(new MetricFamilySamples(
"custom_metric",
Type.GAUGE,
"Custom metric description",
Collections.singletonList(
new MetricFamilySamples.Sample(
"custom_metric",
List.of("label1"),
List.of("value1"),
getCurrentValue()
)
)));
return samples;
}
}
// 注册Collector
new CustomMetricsCollector().register();
2. 告警规则示例
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_errors_total[5m]) / rate(http_server_requests_total[5m]) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }}"
- alert: GCTooLong
expr: rate(jvm_gc_pause_seconds_sum[1h]) > 0.1
labels:
severity: warning
这套监控组合的优势在于:
- 云原生友好:完美契合Kubernetes环境
- 低侵入性:Micrometer作为抽象层减少代码耦合
- 高效存储:Prometheus的TSDB压缩比高
- 丰富可视化:Grafana社区提供大量现成仪表盘
建议实施路径:
- 先搭建基础监控(JVM/HTTP指标)
- 逐步添加业务指标
- 最后实现自定义告警和自动化处理