文章目录
前言
Prometheus是一个开源系统监控和警报工具包,最初由SoundCloud构建。自2012年推出以来,许多公司和组织都采用了Prometheus,该项目拥有非常活跃的开发人员和用户社区。现在它是一个独立的开源项目,独立于任何公司进行维护。为了强调这一点,也为了明确项目的治理结构,Prometheus在2016年加入了云原生计算基金会,成为继Kubernetes之后的第二个托管项目。
Prometheus以时间序列数据的形式收集和存储其度量指标,即度量指标信息与被记录的时间戳一起存储,并与称为标签的可选键值对一起存储。普罗米修斯安装方式有以下几种
- 包管理器安装,如yum, apt.
- 使用Operator安装器,基于K8s。
- 二进制文件安装,在官网有下载。
- 镜像文件安装,docker直接启动。
本文注重讲解Operator如何通过K8s部署监控。二进制安装如何编写启动服务等。
使用Operator部署Prometheus
Operator是基于Go语言编写的工具,可以通过K8s创建,配置,管理Prometheus集群。通过Operator可以将prometheus server、alertmanager、grafana、node-exporter等组件一键批量部署。详情参考github项目。
https://github.com/prometheus-operator/kube-prometheus
下载Operator
下载完成,需要修改镜像源地址:
# 使用git下载kube-prometheus
yum install -y git
git clone -b release-0.11 https://github.com/prometheus-operator/kube-prometheus.git
# 查看文件,其中manifests目录存放部署文件
cd kube-prometheus && ls
build.sh code-of-conduct.md developer-workspace example.jsonnet experimental go.sum jsonnetfile.json kubescape-exceptions.json LICENSE manifests RELEASE.md tests
CHANGELOG.md CONTRIBUTING.md docs examples go.mod jsonnet jsonnetfile.lock.json kustomization.yaml Makefile README.md scripts
# 查看镜像来源
grep image: -r manifests/
image: quay.io...
image: k8s.gcr.io...
# 主要来源于quay.io和k8s.gcr.io,后者无法访问。
grep k8s.gcr.io -r manifests/
manifests/kubeStateMetrics-deployment.yaml: image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.5.0
manifests/prometheusAdapter-deployment.yaml: image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
# 从docker hub下载版本一样的镜像
docker pull bitnami/kube-state-metrics:2.5.0
docker tag bitnami/kube-state-metrics:2.5.0 easzlab.io.local:5000/prom/kube-state-metrics:2.5.0
docker push easzlab.io.local:5000/prom/kube-state-metrics:2.5.0
# 再上传到私有仓库
docker pull willdockerhub/prometheus-adapter:v0.9.1
docker tag willdockerhub/prometheus-adapter:v0.9.1 easzlab.io.local:5000/prom/prometheus-adapter:v0.9.1
docker push easzlab.io.local:5000/prom/prometheus-adapter:v0.9.1
# 修改k8s.gcr.io库成为私有库easzlab.io.local
sed -i 's/k8s.gcr.io\/kube-state-metrics\/kube-state-metrics:v2.5.0/easzlab.io.local:5000\/prom\/kube-state-metrics:2.5.0/g' manifests/kubeStateMetrics-deployment.yaml
sed -i 's/k8s.gcr.io\/prometheus-adapter\/prometheus-adapter:v0.9.1/easzlab.io.local:5000\/prom\/prometheus-adapter:v0.9.1/g' manifests/prometheusAdapter-deployment.yaml
安装所有组件
搭建好k8s环境,便可一键安装所有组件。
# 初始化,包括创建命名空间ns,角色权限等
kubectl apply --server-side -f manifests/setup
# 等待条件资源ns完成
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
# 开始部署所有组件
kubectl apply -f manifests/
# 查看pod创建情况
kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 1/2 Running 0 54s
alertmanager-main-1 1/2 Running 0 54s
alertmanager-main-2 1/2 Running 0 54s
blackbox-exporter-559db48fd-mhx52 3/3 Running 0 102s
grafana-546559f668-zqtk4 1/1 Running 0 100s
kube-state-metrics-54778c94c4-cxfm6 3/3 Running 0 100s
node-exporter-547s8 2/2 Running 0 100s
node-exporter-hwdfd 2/2 Running 0 100s
node-exporter-p4nkn 2/2 Running 0 100s
node-exporter-tv68m 2/2 Running 0 100s
node-exporter-ww9bw 2/2 Running 0 100s
node-exporter-x7ntl 2/2 Running 0 100s
prometheus-adapter-7dc75dc9dc-ck6wr 1/1 Running 0 99s
prometheus-adapter-7dc75dc9dc-gswtt 1/1 Running 0 98s
prometheus-k8s-0 2/2 Running 0 53s
prometheus-k8s-1 2/2 Running 0 53s
prometheus-operator-79c5847fd8-sgg4s 2/2 Running 0 98s
# 查看svc创建情况
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-main ClusterIP 10.68.115.15 <none> 9093/TCP,8080/TCP 2m28s
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 100s
blackbox-exporter ClusterIP 10.68.10.254 <none> 9115/TCP,19115/TCP 2m28s
grafana ClusterIP 10.68.241.190 <none> 3000/TCP 2m26s
kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 2m26s
node-exporter ClusterIP None <none> 9100/TCP 2m26s
prometheus-adapter ClusterIP 10.68.171.128 <none> 443/TCP 2m25s
prometheus-k8s ClusterIP 10.68.228.30 <none> 9090/TCP,8080/TCP 2m25s
prometheus-operated ClusterIP None <none> 9090/TCP 99s
prometheus-operator ClusterIP None <none> 8443/TCP 2m24s
外部访问Prometheus
由于Operator默认配置只有集群内才能访问Prometheus,我们需要从外部访问Prometheus,需要重写svc文件和删除networkPolicy文件。
修改prometheus-service文件:
vim manifests/prometheus-service.yaml
...
spec:
# 添加类型NodePort
type: NodePort
ports:
- name: web
port: 9090
targetPort: web
# 指定端口39090
nodePort: 39090
- name: reloader-web
port: 8080
targetPort: reloader-web
# nodePort: 不指定随机生成
sudo kubectl apply -f manifests/prometheus-service.yaml
service/prometheus-k8s configured
sudo kubectl get svc -n monitoring
prometheus-k8s NodePort 10.68.114.125 <none> 9090:39090/TCP,8080:34076/TCP 14h
同时,Operator出于安全考虑配置了NetworkPolicy,外部不能访问,需要删除对应的prometheus-networkPolicy。
kubectl delete -f manifests/prometheus-networkPolicy.yaml
删除后,便可以通过物理机chrome访问集群节点的39090就能进入Prometheus页面。
外部访问Grafana
由于Operator默认配置只有集群内才能访问Grafana,我们需要从外部访问Grafana,需要重写svc文件和删除networkPolicy文件。
修改grafana-service文件:
vim manifests/grafana-service.yaml
...
spec:
# 添加类型NodePort
type: NodePort
ports:
- name: http
port: 3000
targetPort: http
# 指定端口33000
nodePort: 33000
sudo kubectl apply -f manifests/grafana-service.yaml
service/grafana configured
sudo kubectl get svc -n monitoring
grafana NodePort 10.68.206.132 <none> 3000:33000/TCP 21h
同时,Operator出于安全考虑配置了NetworkPolicy,外部不能访问,需要删除对应的grafana-networkPolicy。
kubectl delete -f manifests/grafana-networkPolicy.yaml
删除后,便可以通过物理机chrome访问集群节点的33000就能进入Grafana页面。
默认用户名 admin 密码 admin
加载default模板后可以查看监控画面:
二进制部署Prometheus
二进制安装Prometheus
各组件二进制文件,官网下载地址:
https://prometheus.io/download/
下载并解压prometheus二进制
wget https://github.com/prometheus/prometheus/releases/download/v2.39.1/prometheus-2.39.1.linux-amd64.tar.gz
tar zxvf prometheus-2.39.1.linux-amd64.tar.gz
mkdir /apps
mv prometheus-2.39.1.linux-amd64 prometheus
mv prometheus /apps
测试配置文件语法是否正确:
./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: prometheus.yml is valid prometheus config file syntax
添加prometheus.service,保证开机自启动:
vim /etc/systemd/system/prometheus.service
[Unit]
Description=The Prometheus monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
WorkingDirectory=/apps/prometheus/
ExecStart=/apps/prometheus/prometheus \
# 配置文件路径
--config.file=/apps/prometheus/prometheus.yml \
# 开启Lifecycle API
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
启动prometheus服务
systemctl daemon-reload
systemctl enable --now prometheus.service
systemctl status prometheus.service
关闭防火墙,但生产环境建议根据业务需求开放端口:
systemctl disable --now firewalld
热加载配置,首先在prometheus.service文件中已添加了–web.enable-lifecycle
systemctl daemon-reload
systemctl restart prometheus.service
curl -X POST http://192.168.100.181:9090/-/reload
二进制安装Node-Exporter
在k8s集群上安装node_exporter,用于检测物理节点的硬件指标。
下载node_exporter压缩包:
wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz
tar zxvf node_exporter-1.4.0.linux-amd64.tar.gz
mv node_exporter-1.4.0.linux-amd64 node_exporter
mv node_exporter /apps
添加node-exporter.service服务文件:
vim /etc/systemd/system/node-exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
ExecStart=/apps/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
改端口Node-Exporter
发现9100已被kube-rbac-prox进程使用,需要改端口为9110
# 9100已被占用
netstat -lntp
tcp 0 0 192.168.100.164:9100 0.0.0.0:* LISTEN 2515/kube-rbac-prox
# 找出exporter修改端口的命令
./node_exporter --help | grep 9100
--web.listen-address=":9100"
# 修改service文件,注意不得有引号。
ExecStart=/apps/node_exporter/node_exporter --web.listen-address=:9110
修改node_exporter.service后,重启服务:
systemctl daemon-reload
systemctl enable --now node-exporter.service
systemctl status node-exporter.service
浏览器访问节点的9110新端口:
Node_Exporter导入Prometheus
回到prometheus,添加新的工作,导入节点地址。
vim /apps/prometheus/prometheus.yml
# 末尾追加
- job_name: 'promethues-node'
static_configs:
- targets: ['192.168.100.164:9110']
# 热加载prometheus
curl -X POST http://192.168.100.181:9090/-/reload
prometheus的targets选项发现新的节点信息:
二进制安装Grafana
下载rpm文件并安装:
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.1-1.x86_64.rpm
sudo rpm -i --nodeps grafana-enterprise-8.4.1-1.x86_64.rpm
修改grafana配置文件
vim /etc/grafana/grafana.ini
#################################### Server ####################################
[server]
# Protocol (http, https, h2, socket)
protocol = http
# The ip address to bind to, empty will bind to all interfaces
http_addr = 0.0.0.0
# The http port to use
http_port = 3000
重启grafana-server,开放防火墙端口:
sudo /bin/systemctl enable --now grafana-server.service
firewall-cmd --permanent --add-port=3000/tcp
firewall-cmd --reload
浏览器访问服务器的3000端口,默认账密admin/admin:
添加数据源
数据源来自Prometheus的9090端口:
查找模板
Grafana模板官网:
https://grafana.com/grafana/dashboards/
Grafana需要模板来展现数据,Node Exporter可以使用11074模板:
导入模板
导入模块可以通过模板ID,或者JSON文件。
点击模板,便可以展现来自Prometheus收集的数据:
排错:alertmanager-main启动失败
在使用Operator部署Prometheus系统时,alertmanager-main只起来1个容器,最后Pod宕机:
sudo kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 1/2 CrashLoopBackOff 2 (3m6s ago) 14h
alertmanager-main-1 1/2 CrashLoopBackOff 2 (3m5s ago) 14h
alertmanager-main-2 1/2 CrashLoopBackOff 2 (3m4s ago) 14h
查看事件,容器无法监听9093端口:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m19s default-scheduler Successfully assigned monitoring/alertmanager-main-0 to 192.168.100.164
Normal Pulling 5m18s kubelet Pulling image "quay.io/prometheus/alertmanager:v0.24.0"
Normal Pulled 5m12s kubelet Successfully pulled image "quay.io/prometheus/alertmanager:v0.24.0" in 5.813652964s
Normal Created 5m12s kubelet Created container alertmanager
Normal Started 5m12s kubelet Started container alertmanager
Normal Pulling 5m12s kubelet Pulling image "quay.io/prometheus-operator/prometheus-config-reloader:v0.57.0"
Normal Pulled 5m7s kubelet Successfully pulled image "quay.io/prometheus-operator/prometheus-config-reloader:v0.57.0" in 5.643906655s
Normal Created 5m7s kubelet Created container config-reloader
Normal Started 5m7s kubelet Started container config-reloader
Warning Unhealthy 4m29s (x4 over 4m59s) kubelet Liveness probe failed: Get "http://172.20.153.210:9093/-/healthy": dial tcp 172.20.153.210:9093: connect: connection refused
Warning Unhealthy 14s (x67 over 5m6s) kubelet Readiness probe failed: Get "http://172.20.153.210:9093/-/ready": dial tcp 172.20.153.210:9093: connect: connection refused
排错计划A:进入Pod查看容器的程序,配置文件,权限等,未发现异常。
排错计划B:网上搜索相关问题,发现遇到同样问题数量少,同时没有解答。
alertmanager CrashLoopBackOff #2552
https://github.com/prometheus-operator/prometheus-operator/issues/2552
排错计划C:查看k8s各组件运行状态,发现core-dns未安装。
部署core-dns后,alertmanager-main启动正常。
sudo kubectl apply -f coredns.yaml
sudo kubectl get pod -A
kube-system coredns-f58cf8cd9-jgjc9 1/1 Running 3 (110m ago) 39h
kube-system coredns-f58cf8cd9-q8lvm 1/1 Running 3 (110m ago) 39h
总结:
alertmanager-main启动失败,报错信息和网络解答不能直接解答。
需从自身k8s集群组件完整性入手,比如core-dns属于系统组件,必须安装。