ES05 - 集群的运维和安全
一:集群管理、运维与调优
1:集群健康监控
GET /_cluster/health # 返回状态(green/yellow/red)、节点数、分片数、未分配分片数等
green
:所有主副分片均正常。yellow
:主分片正常,但部分副本未分配(单节点集群默认状态)。red
:主分片缺失(数据不可用)。
2:节点状态与统计
GET /_nodes/stats # 所有节点统计(JVM/索引/磁盘等)
GET /_nodes/node_id/hot_threads # 定位节点 CPU 瓶颈
例如:某节点 CPU 持续 90%+
,hot_threads
显示 merge
线程占满。
原因:小文档频繁写入导致段合并压力。优化:调大 refresh_interval
(从 1s
改为 30s
)
3:索引的管理
3.1:创建索引(指定分片和副本)
PUT /logs-2023
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"timestamp": { "type": "date" }
}
}
}
3.2:索引的别名
POST /_aliases
{
"actions": [
{
"add": { // 声明是添加操作,为logs-2023这个数据库起一个别名叫做current-logs
"index": "logs-2023",
"alias": "current-logs"
}
}
]
}
3.3:索引模板
这个是ES7.8+新特性
PUT /_index_template/logs_template
{
"index_patterns": ["logs-*"], // 模板匹配的数据库是logs-开头的数据库
// 对于这类数据库应用如下的模板
"template": {
"settings": { "number_of_shards": 3 },
"mappings": { ... }
}
}
3.4:零停机重建索引
1️⃣ 为旧索引 logs-old
添加别名 logs-alias
POST /_aliases
{
"actions": [
{
"add": {
"index": "logs-2023",
"alias": "current-logs" # 应用通过此别名访问
}
}
]
}
2️⃣ 新建优化后的索引 logs-new
PUT /logs-2023-optimized
{
"settings": {
"number_of_shards": 6, // 增加分片数提升并发
"number_of_replicas": 1
},
"mappings": {
"properties": {
"user_id": { "type": "keyword" }, // 修正字段类型
"timestamp": { "type": "date" },
"message": { "type": "text" }
}
}
}
3️⃣ 异步数据迁移:POST _reindex?wait_for_completion=false
。
POST /_reindex?wait_for_completion=false
{
"source": {
"index": "logs-2023"
},
"dest": {
"index": "logs-2023-optimized"
}
}
4️⃣ 检查迁移状态
执行如下的命令,当 completed=true
时继续下一步
GET /_tasks?detailed=true&actions=*reindex
5️⃣ 原子化切换别名指向 logs-new
,删除旧索引
POST /_aliases
{
// 此操作是原子性的,应用无感知(切换耗时毫秒级)
"actions": [
{
"remove": { // 解绑旧索引
"index": "logs-2023",
"alias": "current-logs"
}
},
{
"add": { // 绑定新索引
"index": "logs-2023-optimized",
"alias": "current-logs"
}
}
]
}
如果失败了,立即回滚到旧索引
POST /_aliases
{
"actions": [
{ "remove": { "index": "logs-2023-optimized", "alias": "current-logs" }},
{ "add": { "index": "logs-2023", "alias": "current-logs" }}
]
}
6️⃣ 清理旧的索引
DELETE /logs-2023 # 确认新索引工作正常后执行
4:分片的管理
!!!创建索引之后,不能修改分片数
4.1:调整副本数
PUT /index/_settings
{
"index.number_of_replicas": 2 // 调整副本数为2
}
4.2:强制进行段合并(谨慎操作)
POST /index/_forcemerge?max_num_segments=1 # 合并为 1 个段
4.3:分片分配控制
PUT /_cluster/settings
{
"transient": { "cluster.routing.allocation.enable": "primaries" } // 仅分配主分片
}
4.4:分片管理实践
集群磁盘不均,节点 A 使用率 90%,节点 B 仅 30%。手动迁移分片:
POST /_cluster/reroute
{
"commands": [
{
"move":
{
"index": "my_index", "shard": 0, "to_node": "node_B"
}
}
]
}
5:快照和恢复
1️⃣ 注册仓库
PUT /_snapshot/my_s3_repo
{
"type": "s3",
"settings": {
"bucket": "my-es-backups",
"region": "us-east-1"
}
}
2️⃣ 创建快照
PUT /_snapshot/my_s3_repo/snapshot_20231001?wait_for_completion=true
3️⃣ 通过指定的快照恢复
POST /_snapshot/my_s3_repo/snapshot_20231001/_restore
{
"indices": "orders",
"rename_pattern": "(.+)",
"rename_replacement": "restored_$1"
}
6:滚动重启 & 升级
6.1:滚动重启
在保证集群高可用性的前提下,逐个重启节点完成维护操作(如配置更新、内存泄漏修复、硬件维护等)
1️⃣ 禁用分片分配
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
2️⃣ 停止索引写入
POST _all/_close
3️⃣ 执行同步刷新(确保数据持久化)
POST _flush/synced
4️⃣ 逐个重启节点
# 停止服务
sudo systemctl stop elasticsearch
# 执行维护操作(如更新配置)
sudo nano /etc/elasticsearch/elasticsearch.yml
# 启动服务
sudo systemctl start elasticsearch
5️⃣ 等待节点恢复
# 监控节点状态
GET _cat/nodes?v&h=name,ip,version,ram.percent,node.role,load_1m
# 检查分片状态
GET _cat/shards?v&h=index,shard,prirep,state,node&s=node
6️⃣ 重新启用分片分配
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": null
}
}
7️⃣ 等待集群变绿
watch -n 5 'curl -sXGET "localhost:9200/_cluster/health?pretty"'
6.2:滚动升级
版本兼容性矩阵
当前版本 | 可升级版本 | 升级路径 |
---|---|---|
7.x | 8.x | 7.17 → 8.0 → 8.x |
6.8+ | 7.x | 6.8 → 7.17 → 8.x |
5.x | 6.x | 需完整集群重启 |
< 5.x | 新版本 | 需重建索引 |
1️⃣ 健康检查
GET _cluster/health?pretty
GET _cat/indices?health=red
2️⃣ 启用API检查 -> 必须解决所有 critical
级别问题
GET /_migration/deprecations?pretty
3️⃣ 数据的备份
# 创建快照仓库
PUT _snapshot/upgrade_backup
{
"type": "fs",
"settings": {"location": "/mnt/backups/upgrade_2023"}
}
# 执行全量快照
PUT _snapshot/upgrade_backup/snapshot_pre_upgrade?wait_for_completion=true
4️⃣ 插件兼容性的检查
bin/elasticsearch-plugin list
# 下载新版插件包备用
5️⃣ 禁用分片分配
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
6️⃣ 停止非必要的索引
POST .ml*, .apm*, .transform*/_close
7️⃣ 逐个节点升级
# 1. 停服务
sudo systemctl stop elasticsearch
# 2. 升级包 (以RPM为例)
sudo rpm -Uvh elasticsearch-8.10.0-x86_64.rpm
# 3. 解决配置冲突
sudo diff /etc/elasticsearch/elasticsearch.yml.rpmnew \
/etc/elasticsearch/elasticsearch.yml
# 4. 启动服务
sudo systemctl start elasticsearch
8️⃣ 升级后验证
# 检查节点版本
GET _nodes?filter_path=nodes.*.version
# 测试核心功能
POST test_upgrade/_doc {"message":"upgrade test"}
GET test_upgrade/_search
9️⃣ 重新启用功能
# 开启分片分配
PUT _cluster/settings {"persistent": {"cluster.routing.allocation.enable":null}}
# 打开系统索引
POST .ml*, .apm*, .transform*/_open
6.3:版本升级特例处理
跨大版本升级(7.x -> 8.x)
# 必须步骤:设置兼容标志
PUT _cluster/settings
{
"persistent": {
"cluster.indices.close.enable": true
}
}
安全配置升级
# 8.x默认开启安全
bin/elasticsearch-reset-password -u elastic
分词器重新加载
POST _nodes/reload_secure_settings
6.4:灾难恢复
1️⃣ 停止服务
sudo systemctl stop elasticsearch
2️⃣ 降级安装
sudo rpm -Uvh --oldpackage elasticsearch-7.17.10-x86_64.rpm
3️⃣ 恢复配置
cp /etc/elasticsearch/elasticsearch.yml.bak \
/etc/elasticsearch/elasticsearch.yml
4️⃣ 启动集群
sudo systemctl start elasticsearch
5️⃣ 数据恢复
POST _snapshot/upgrade_backup/snapshot_pre_upgrade/_restore
7:性能调优
7.1:硬件与基础配置
项目 | 优化建议 |
---|---|
JVM Heap | ≤ 31GB (避免 Compressed OOPS 失效) |
磁盘 | SSD (避免 NAS/网络存储) |
文件描述符 | ulimit -n 65536 |
虚拟内存 | sysctl -w vm.max_map_count=262144 |
7.2:索引设计的优化
分片数的计算
目标分片大小 30–50GB。
例如:每日数据 100GB,保留 30 天 → 总分片数 = (100GB*30)/40GB ≈ 75
按 5 节点均分,每节点 15 分片
避免字段爆炸
用 flattened
类型处理动态字段(如 JSON 日志)
7.3:查询优化
filter -> query
"query": {
"bool": {
"filter": [{"range": {"timestamp": {"gte": "now-1d"}}}], // 不计算得分
"must": [{"match": {"product": "phone"}}] // 计算得分
}
}
深度分页替代方案
GET /index/_search
{
"size": 100,
"sort": [{"timestamp": "desc"}, {"_id": "asc"}],
"search_after": ["2023-10-01T00:00:00", "abcd1234"]
}
7.4:写入优化
参数 | 优化值 | 效果 |
---|---|---|
refresh_interval |
30s |
减少 Lucene 段生成频率 |
translog.durability |
async |
异步写 translog(风险:丢失 5s 数据) |
indices.memory.index_buffer_size |
20% |
增大索引缓冲区内存占比 |
二:安全
1:认证与授权
1.1:基础安全
修改elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
1.2:内置用户初始化
bin/elasticsearch-setup-passwords auto # 自动生成密码
1.3:RBAC 权限控制
RBAC黄金法则
- 最小权限原则(用户只拥有必要权限)
- 角色继承机制(创建基础角色+部门扩展角色)
- 定期权限审计(每月检查权限分配)
权限类别 | 权限示例 | 说明 |
---|---|---|
集群权限 | monitor , manage , all |
控制集群级操作(如节点状态查看、设置修改) |
索引权限 | read , write , delete_index |
控制对特定索引的操作权限 |
字段级权限 | grant: ['*'], except: ['salary'] |
控制对索引字段的访问 |
文档级权限 | { "term": { "department": "engineering" } } |
基于文档内容的访问控制 |
应用权限 | kibana_admin , logs_viewer |
控制 Kibana 等应用的访问 |
1️⃣ 创建角色(限制访问特定索引)
POST /_security/role/logs_reader
{
"indices": [
{
"names": ["logs-*"],
"privileges": ["read", "view_index_metadata"]
}
]
}
// 包含字段级别和文档级别的限制
POST /_security/role/finance_auditor
{
"cluster": ["monitor", "manage_ilm"],
"indices": [
{
"names": ["transactions-*"],
"privileges": ["read", "view_index_metadata"],
"field_security": {
"grant": ["*"],
"except": ["credit_card_number"]
},
"query": {
"term": { "department": "finance" }
}
}
],
"applications": [
{
"application": "kibana-.kibana",
"privileges": ["feature_discover.read", "feature_dashboard.read"],
"resources": ["space:finance"]
}
]
}
2️⃣ 创建用户并绑定角色:
POST /_security/user/john
{
"password": "password123",
"roles": ["logs_reader"]
}
2:传输加密(TLS)
2.1:基本使用
1️⃣ 生成证书
# 创建 CA
bin/elasticsearch-certutil ca \
--out config/certs/elastic-stack-ca.p12 \
--pass "CaSecret@2023"
# 生成节点证书(包含所有节点DNS和IP)
bin/elasticsearch-certutil cert \
--ca config/certs/elastic-stack-ca.p12 \
--ca-pass "CaSecret@2023" \
--out config/certs/elastic-nodes.p12 \
--pass "NodeCertPass!2023" \
--dns node1.escluster,node2.escluster,node3.escluster \
--ip 192.168.1.101,192.168.1.102,192.168.1.103
2️⃣ 配置elasticsearch.yml
# ======== 节点间通信加密 ========
xpack.security.transport.ssl:
enabled: true
verification_mode: certificate
keystore.path: certs/elastic-nodes.p12
keystore.password: "NodeCertPass!2023"
truststore.path: certs/elastic-nodes.p12
truststore.password: "NodeCertPass!2023"
# ======== HTTP API 加密 ========
xpack.security.http.ssl:
enabled: true
keystore.path: certs/elastic-nodes.p12
keystore.password: "NodeCertPass!2023"
verification_mode: certificate
# ======== 高级安全配置 ========
# 启用客户端证书认证
xpack.security.http.ssl.client_authentication: optional
# 强制加密通信
xpack.security.authc.token.enabled: true
xpack.security.authc.api_key.enabled: true
2.2:Kibana安全连接配置
在kibana.yml中配置
# Elasticsearch 加密连接
elasticsearch.hosts: ["https://node1.escluster:9200"]
elasticsearch.ssl:
certificateAuthorities: /path/to/elastic-stack-ca.crt
verificationMode: certificate
# Kibana 服务端HTTPS
server.ssl:
enabled: true
certificate: /path/to/kibana.crt
key: /path/to/kibana.key
# 安全凭证
elasticsearch.username: "kibana_system"
elasticsearch.password: "StrongKibanaPass!2023"
2.3:客户端证书认证
# 生成客户端证书
bin/elasticsearch-certutil cert \
--ca config/certs/elastic-stack-ca.p12 \
--ca-pass "CaSecret@2023" \
--name "api-client" \
--out config/certs/api-client.p12
# 使用证书访问
curl -E config/certs/api-client.p12 \
--pass "ClientCertPass" \
https://escluster:9200/_cluster/health
2.4:证书自动轮换
# 每月1号自动轮换证书
0 0 1 * * /opt/elasticsearch/scripts/rotate_certs.sh
#!/bin/bash
# rotate_certs.sh
NEW_PASS=$(openssl rand -base64 16)
# 生成新证书
bin/elasticsearch-certutil cert ... --pass $NEW_PASS
# 动态更新节点
PUT /_nodes/reload_secure_settings
{
"secure_settings_password": "OldPass@2023"
}
# 更新配置文件
sed -i "s/NodeCertPass!2023/$NEW_PASS/g" elasticsearch.yml
# 重启服务(滚动重启)
systemctl restart elasticsearch
2.5:安全审计配置
# elasticsearch.yml
xpack.security.audit.enabled: true
xpack.security.audit.logfile.events.include: access_denied,anonymous_access_denied,authentication_failed,tampered_request,connection_denied
xpack.security.audit.logfile.events.exclude: authentication_success