K8s 集群CoreDNS监控告警最佳实践

发布于:2025-07-24 ⋅ 阅读:(17) ⋅ 点赞:(0)

背景

coreDNS作为K8s集群中的关键组成部分。主要负责k8s集群中的服务发现,域名解析等功能。如果在使用过程中出现域名解析失败,域名解析超时等情况,需要引起注意。

coreDNS关键指标

确保Prometheus已经成功抓取coreDNS相关指标

image.png


  • coreDNS请求速率: sum(rate(coredns_dns_requests_total{}[5m])) by (proto,instance)

  • coreDNS请求速率(记录类型分组): sum(rate(coredns_dns_requests_total{}[5m])) by (type,instance)

  • coreDNS请求速率(DO标志位): sum(rate(coredns_dns_do_requests_total{}[5m])) by (instance)

  • coreDNS UDP请求数据包大小:
    P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
    P90:
    histogram_quantile(0.90,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
    P50:
    histogram_quantile(0.50,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))

  • coreDNS TCP请求数据包大小:
    P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
    P90:
    histogram_quantile(0.90,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
    P50:
    histogram_quantile(0.50,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))

  • coreDNS响应速率(根据响应状态码分组): sum(rate(coredns_dns_responses_total{}[5m])) by(rcode,instance)

  • coreDNS响应时延:
    P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))
    P90:
    histogram_quantile(0.90,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))
    P50:
    histogram_quantile(0.50,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))

  • coreDNS UDP响应数据包大小:
    P99: histogram_quantile(0.99,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
    P90:
    histogram_quantile(0.90,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
    P50:
    histogram_quantile(0.50,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))

  • coreDNS TCP响应数据包大小
    P99: histogram_quantile(0.99,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
    P90:
    histogram_quantile(0.90,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
    P50:
    histogram_quantile(0.50,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))

  • coreDNS缓存的DNS记录数: sum (coredns_cache_entries{}) by(type,instance)

  • coreDNS缓存命中率:
    sum (rate(coredns_cache_hits_total{}[5m])) by (type,instance)

  • coreDNS缓存丢失率:
    sum (rate(coredns_cache_misses_total{}[5m])) by (type,instance)

其中主要关注:p99coreDNS响应时延coreDNS请求速率coreDNS缓存命中率指标,其中p99coreDNS响应时延基于域名解析超时时间一般为2s,可以初步设置告警阈值为1s,后续再根据实际监控数据根据指标进一步设置一个更加精细阈值。

选择指标告警规则,配置方式可使用PromQL语句

 


网站公告

今日签到

点亮在社区的每一天
去签到