K8s 指标收集方案对比

发布于:2025-06-14 ⋅ 阅读:(22) ⋅ 点赞:(0)

1. Background

Megacloud Portal needs to add monitoring for K8S. The current demand is

Obtain the CPU/Memory metrics of Node and Pod in K8S, and display TopN after processing.

To achieve this function, the server works as follows:

  • Collects resource metrics of K8S Node and Pod
  • ETL processing and storage of collected data
  • Implement API for front end to accquire data

The key is ** the collection and processing of metric data**. The following is a brief introduction and comparison of related collection schemes.

2. Solutions

2.1. MetricBeat

MetricBeat is a metric collection tool provided by Elasic. It can collect metrics from many open source software including Kubernetes, and can send data to ElasticSearch, Kafka, Redis, and Logstash for processing or storage.

The following is the data format collected from K8S Node and Pod in MetricBeat.

-Node Data Format (information other than CPU/Memory has been omitted)

{
    "@timestamp": "2017-04-06T15:29:27.150Z",
    "beat": {
        "hostname": "beathost",
        "name": "beathost",
        "version": "6.0.0-alpha1"
    },
    "kubernetes": {
        "node": {
            "cpu": {
                "usage": {
                    "core" : {
                        "ns": 7247863769557035
                    },
                    "nanocores": 1662117892
                }
            },
            "memory": {
                "available": {
                    "bytes": 134202847232
                },
                "majorpagefaults": 1044,
                "pagefaults": 83482928,
                "rss": {
                    "bytes": 178053120
                },
                "usage": {
                    "bytes": 67062091776
                },
                "workingset": {
                    "bytes": 51496206336
                }
            },
            "name": "localhost",
            "start_time": "2017-02-08T10:33:38Z"
        }
    },
    "metricset": {
        "host": "localhost:10255",
        "module": "kubernetes",
        "name": "node",
        "rtt": 650741
    },
    "type": "metricsets"
}

{

  "beat": {
    "hostname": "X1",
    "name": "X1",
    "version": "6.0.0-alpha1"
  },
  "kubernetes": {
    "node": {
      "cpu": {
        "allocatable": {
          "cores": 2
        },
        "capacity": {
          "cores": 2
        }
      },
      "memory": {
        "allocatable": {
          "bytes": 2097786880
        },
        "capacity": {
          "bytes": 2097786880
        }
      },
      "name": "minikube",
      "pod": {
        "allocatable": {
          "total": 110
        },
        "capacity": {
          "total": 110
        }
      },
      "status": {
        "ready": "true",
        "unschedulable": false
      }
    }
  },
  "metricset": {
    "host": "192.168.99.100:18080"
  }
}
  • Pod Data Format
{
    "@timestamp": "2017-04-06T15:29:27.150Z",
    "beat": {
        "hostname": "beathost",
        "name": "beathost",
        "version": "6.0.0-alpha1"
    },
    "kubernetes": {
        "namespace": "ns",
        "node": {
          "name": "localhost",
        },
        "pod": {
            "name": "nginx-3137573019-pcfzh",
            "uid": "b89a812e-18cd-11e9-b333-080027190d51",
            "network": {
                "rx": {
                    "bytes": 18999261,
                    "errors": 0
                },
                "tx": {
                    "bytes": 28580621,
                    "errors": 0
                }
            },
            "start_time": "2017-04-06T12:09:05Z"
        }
    },
    "metricset": {
        "host": "localhost:10255",
        "module": "kubernetes",
        "name": "pod",
        "rtt": 636230
    },
    "type": "metricsets"
}
  • Container Data Format
{
    "@timestamp": "2017-04-06T15:29:27.150Z",
    "beat": {
        "hostname": "beathost",
        "name": "beathost",
        "version": "6.0.0-alpha1"
    },
    "kubernetes": {
        "container": {
            "cpu": {
                "usage": {
                    "core": {
                        "ns": 3305756719
                    },
                    "nanocores": 5992
                }
            },
            
            "memory": {
                "available": {
                    "bytes": 0
                },
                "majorpagefaults": 47,
                "pagefaults": 2298,
                "rss": {
                    "bytes": 1441792
                },
                "usage": {
                    "bytes": 7643136
                },
                "workingset": {
                    "bytes": 1466368
                }
            },
            "name": "nginx",
        },
        "namespace": "ns",
        "node": {
          "name": "localhost"
        },
        "pod": {
            "name": "nginx-3137573019-pcfzh",
        }
    },
    "metricset": {
        "host": "localhost:10255",
        "module": "kubernetes",
        "name": "container",
        "rtt": 650739
    },
    "type": "metricsets"
}

MetricBeat only has CPU/Memory indicator data for Node and container. If we use MetricBeat for collection, we need to do the following:

  • Deploy MetricBeat on K8S. We may need to do a lot of manual operations
  • We ne

2.2 Telegraf

Telegraf is an open source software written in Go for metric collection. Like MetricBeat, it provides numerous plugins to collect data from multiple sources.

For Kubernetes, Telegraf provides a Kubernetes plugin to collect data. It gets data through Kubelet’s stats/sumary API. It can also be used with the Prometheus plugin to collect more metric data.

The following are some metric data formats. Like MertricBeat, it does not provide Pod-level CPU/Memory statistics, and needs to be aggregated based on container data.

type NodeMetrics struct {
	NodeName         string             `json:"nodeName"`
	SystemContainers []ContainerMetrics `json:"systemContainers"`
	StartTime        time.Time          `json:"startTime"`
	CPU              CPUMetrics         `json:"cpu"`
	Memory           MemoryMetrics      `json:"memory"`
	Network          NetworkMetrics     `json:"network"`
	FileSystem       FileSystemMetrics  `json:"fs"`
	Runtime          RuntimeMetrics     `json:"runtime"`
}

// PodMetrics contains metric data on a given pod
type PodMetrics struct {
	PodRef     PodReference       `json:"podRef"`
	StartTime  *time.Time         `json:"startTime"`
	Containers []ContainerMetrics `json:"containers"`
	Network    NetworkMetrics     `json:"network"`
	Volumes    []VolumeMetrics    `json:"volume"`
}

// ContainerMetrics represents the metric data collect about a container from the kubelet
type ContainerMetrics struct {
	Name      string            `json:"name"`
	StartTime time.Time         `json:"startTime"`
	CPU       CPUMetrics        `json:"cpu"`
	Memory    MemoryMetrics     `json:"memory"`
	RootFS    FileSystemMetrics `json:"rootfs"`
	LogsFS    FileSystemMetrics `json:"logs"`
}

2.3 MetricServer

MetricServer also obtains metric data through the /stats/summary API provided by Kubelet. MetricServer stores the data in memory, and then provides API based on the kube-Aggregator mechanism to provide external access.

The fixed URL prefix of the API provided by MetricSever is
/apis/metrics/v1alpha1/, and then combined with the following APIs for external access to metric data, all APIs only support the GET method:

  • /nodes-Get all Node’s metric data.
  • /nodes/(node)-Get metric data of the specified Node.
  • /namespaces/(namespace)/pods-Get all Pod metrics under a certain namespace.
  • /namespaces/(namespace)/pods/(pod)- Get the metric data of the specified Pod.

In addition, We can view the CPU/Memory metrics of Node and Pod through the kubectl top command on the terminal.

$ kubectl top nodes
NAME            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
tk01            217m         10%    5296Mi          68%
vm-0-2-ubuntu   84m          4%     1189Mi          32%

$ kubectl top pods --all-namespaces
NAMESPACE     NAME                              CPU(cores)   MEMORY(bytes)
kube-system   coredns-f9fd979d6-jzv8q           4m           10Mi
kube-system   coredns-f9fd979d6-tx9m4           4m           10Mi
kube-system   etcd-tk01                         14m          50Mi
kube-system   kube-apiserver-tk01               31m          293Mi

K8S provides libraries to access the above APIs. Now We has implemented the first version of metric-server-collector, which is based on MetricServer API to obtain the CPU/Memory metrics of Node and Pod and convert them into the data format we need.

2.4 Kubelet cAdvisor

Kubelet integrates cAdvisor to collect statistics on the CPU, memory, file system, and network usage of the container on the node. It provides the /stats/summary API for externally obtaining the metric data, The above-mentioned Telegraf and MetricServer schemes all obtain metric data through this API. Therefore, we can access Kubelet’s API directly to obtain metric data instead of use above tools.

For this solution, the following changes need to be made based on the current metric-server-collector:

  • Rewrite the scraper code, replace the MetricServer API, and access the kubelet API instead to obtain indicator data.
  • Convert the data returned by Kubelet to the required data format.

3. Comparison

Based on the above information, the development, operation and maintenance work of each program is compared as follows

Solution Development task Development complexity Deployment operation Deployment complexity Others
MetricBeat Develop ETL tools to process data ★ ★ ★ MetricBeat + ETL tools + data transmission & storage components. ★ ★ ★
Telegraf Develop Telegraf processor or related ETL tools to process data ★ ★ ★ Deploy Telegraf + ETL tools ★ ★ ★
MetricServer No additional development required MetricServer + collector The data is stored in memory, which consumes resources when the amount of data is large
Kubelet cAdvisor Re-implement the collector, expected one week ★ ★ Only need to deploy collector

Based on the above comparison, the preliminary conclusions are as follows:

  • The data format of MetricBeat does not meet the requirements and requires more additional processing, so it is not considered.

  • On the premise of only collecting CPU/Memory metrics for Node and Pod in K8S, Telegraf is a bit of a slasher, and requires additional development and operation and maintenance work. It can be temporarily stopped without collecting more K8S metrics. consider.

  • Compared with the above two schemes, it is very simple to deploy MetricSeverr on K8S, and it can be combined with the metric-server-collector to meet the requirements.

  • The collector implementation based on the Kubelet API can be regarded as an optimization based on the MetricServer implementation, which reduces unnecessary component operation and resource consumption. We can get the most primitive data for conversion on demand.

Reference

Appendix

  • Kubelet /stats/summary API response data
{
    "node": {
        "nodeName": "tk01", 
        "systemContainers": [
            {
                "name": "kubelet", 
                "startTime": "2021-01-26T05:10:01Z", 
                "cpu": {
                    "time": "2021-01-26T05:10:22Z", 
                    "usageNanoCores": 108726826, 
                    "usageCoreNanoSeconds": 2009799168
                }, 
                "memory": {
                    "time": "2021-01-26T05:10:22Z", 
                    "usageBytes": 61022208, 
                    "workingSetBytes": 52633600, 
                    "rssBytes": 32174080, 
                    "pageFaults": 45899, 
                    "majorPageFaults": 253
                }
            }
        ], 
        "startTime": "2020-09-12T12:43:58Z", 
        "cpu": {
            "time": "2021-01-26T05:10:27Z", 
            "usageNanoCores": 1078216675, 
            "usageCoreNanoSeconds": 1758203828797878
        }, 
        "memory": {
            "time": "2021-01-26T05:10:27Z", 
            "availableBytes": 2564227072, 
            "usageBytes": 7543758848, 
            "workingSetBytes": 5632106496, 
            "rssBytes": 3926122496, 
            "pageFaults": 2304667, 
            "majorPageFaults": 859
        }, 
        "network": {
            "time": "2021-01-26T05:10:27Z", 
            "name": "eth0", 
            "rxBytes": 166587714846, 
            "rxErrors": 0, 
            "txBytes": 192097080030, 
            "txErrors": 0, 
            "interfaces": [
                {
                    "name": "eth0", 
                    "rxBytes": 166587714846, 
                    "rxErrors": 0, 
                    "txBytes": 192097080030, 
                    "txErrors": 0
                }
            ]
        }, 
        "fs": {
            "time": "2021-01-26T05:10:27Z", 
            "availableBytes": 23964233728, 
            "capacityBytes": 52776349696, 
            "usedBytes": 26562187264, 
            "inodesFree": 2916648, 
            "inodes": 3276800, 
            "inodesUsed": 360152
        }, 
        "runtime": {
            "imageFs": {
                "time": "2021-01-26T05:10:27Z", 
                "availableBytes": 23964233728, 
                "capacityBytes": 52776349696, 
                "usedBytes": 1377648552, 
                "inodesFree": 2916648, 
                "inodes": 3276800, 
                "inodesUsed": 360152
            }
        }, 
        "rlimit": {
            "time": "2021-01-26T05:10:32Z", 
            "maxpid": 32768, 
            "curproc": 905
        }
    }, 
    "pods": [
        {
            "podRef": {
                "name": "etcd-tk01", 
                "namespace": "kube-system", 
                "uid": "2e8885329cb9c936db545fcd71666003"
            }, 
            "startTime": "2021-01-26T05:10:08Z", 
            "containers": [
                {
                    "name": "etcd", 
                    "startTime": "2021-01-26T05:10:09Z", 
                    "cpu": {
                        "time": "2021-01-26T05:10:22Z", 
                        "usageNanoCores": 208950881, 
                        "usageCoreNanoSeconds": 2582358831
                    }, 
                    "memory": {
                        "time": "2021-01-26T05:10:22Z", 
                        "usageBytes": 37478400, 
                        "workingSetBytes": 37081088, 
                        "rssBytes": 36110336, 
                        "pageFaults": 11531, 
                        "majorPageFaults": 10
                    }, 
                    "rootfs": {
                        "time": "2021-01-26T05:10:22Z", 
                        "availableBytes": 23964233728, 
                        "capacityBytes": 52776349696, 
                        "usedBytes": 36864, 
                        "inodesFree": 2916648, 
                        "inodes": 3276800, 
                        "inodesUsed": 8
                    }, 
                    "logs": {
                        "time": "2021-01-26T05:10:22Z", 
                        "availableBytes": 23964233728, 
                        "capacityBytes": 52776349696, 
                        "usedBytes": 28672, 
                        "inodesFree": 2916648, 
                        "inodes": 3276800, 
                        "inodesUsed": 360152
                    }
                }
            ], 
            "cpu": {
                "time": "2021-01-26T05:10:24Z", 
                "usageNanoCores": 161540656, 
                "usageCoreNanoSeconds": 34928899852771
            }, 
            "memory": {
                "time": "2021-01-26T05:10:24Z", 
                "usageBytes": 71917568, 
                "workingSetBytes": 67014656, 
                "rssBytes": 36155392, 
                "pageFaults": 0, 
                "majorPageFaults": 0
            }, 
            "network": {}, 
            "ephemeral-storage": {
                "time": "2021-01-26T05:10:27Z", 
                "availableBytes": 23964233728, 
                "capacityBytes": 52776349696, 
                "usedBytes": 65536, 
                "inodesFree": 2916648, 
                "inodes": 3276800, 
                "inodesUsed": 8
            }, 
            "process_stats": {
                "process_count": 0
            }
        }
    ]
}