ES05 - 集群的运维和安全

发布于:2025-07-03 ⋅ 阅读:(29) ⋅ 点赞:(0)

ES05 - 集群的运维和安全

一:集群管理、运维与调优

1:集群健康监控

GET /_cluster/health # 返回状态(green/yellow/red)、节点数、分片数、未分配分片数等
  • green:所有主副分片均正常。
  • yellow:主分片正常,但部分副本未分配(单节点集群默认状态)。
  • red:主分片缺失(数据不可用)。

2:节点状态与统计

GET /_nodes/stats          # 所有节点统计(JVM/索引/磁盘等)
GET /_nodes/node_id/hot_threads  # 定位节点 CPU 瓶颈

例如:某节点 CPU 持续 90%+hot_threads 显示 merge 线程占满。

原因:小文档频繁写入导致段合并压力。优化:调大 refresh_interval(从 1s 改为 30s

3:索引的管理

3.1:创建索引(指定分片和副本)
PUT /logs-2023
{
    "settings": { 
        "number_of_shards": 3, 
        "number_of_replicas": 1 
    },
    "mappings": { 
        "properties": { 
            "timestamp": { "type": "date" } 
        } 
    }
}
3.2:索引的别名
POST /_aliases
{
  "actions": [
      { 
          "add": { // 声明是添加操作,为logs-2023这个数据库起一个别名叫做current-logs
              "index": "logs-2023", 
              "alias": "current-logs"
          } 
      }
  ]
}
3.3:索引模板

这个是ES7.8+新特性

PUT /_index_template/logs_template
{
    "index_patterns": ["logs-*"], // 模板匹配的数据库是logs-开头的数据库
    // 对于这类数据库应用如下的模板
    "template": {
        "settings": { "number_of_shards": 3 },
        "mappings": { ... }
     }
}
3.4:零停机重建索引

1️⃣ 为旧索引 logs-old 添加别名 logs-alias

POST /_aliases
{
    "actions": [
        {
            "add": {
                "index": "logs-2023", 
                "alias": "current-logs"  # 应用通过此别名访问
            }
        }
    ]
}

2️⃣ 新建优化后的索引 logs-new

PUT /logs-2023-optimized
{
    "settings": {
        "number_of_shards": 6,      // 增加分片数提升并发
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "user_id": { "type": "keyword" },  // 修正字段类型
            "timestamp": { "type": "date" },
        "message": { "type": "text" }
    }
}
}

3️⃣ 异步数据迁移:POST _reindex?wait_for_completion=false

POST /_reindex?wait_for_completion=false
{
    "source": { 
        "index": "logs-2023" 
    },
    "dest": { 
        "index": "logs-2023-optimized" 
    }
}

4️⃣ 检查迁移状态

执行如下的命令,当 completed=true 时继续下一步

GET /_tasks?detailed=true&actions=*reindex

5️⃣ 原子化切换别名指向 logs-new,删除旧索引

POST /_aliases
{
    // 此操作是原子性的,应用无感知(切换耗时毫秒级)
    "actions": [
        {
            "remove": {  // 解绑旧索引
                "index": "logs-2023", 
                "alias": "current-logs"
            }
        },
        {
            "add": {    // 绑定新索引
                "index": "logs-2023-optimized", 
                "alias": "current-logs"
            }
        }
    ]
}

如果失败了,立即回滚到旧索引

POST /_aliases
{
    "actions": [
        { "remove": { "index": "logs-2023-optimized", "alias": "current-logs" }},
        { "add": { "index": "logs-2023", "alias": "current-logs" }}
    ]
}

6️⃣ 清理旧的索引

DELETE /logs-2023  # 确认新索引工作正常后执行

4:分片的管理

!!!创建索引之后,不能修改分片数

4.1:调整副本数
PUT /index/_settings
{ 
    "index.number_of_replicas": 2  // 调整副本数为2
}
4.2:强制进行段合并(谨慎操作)
POST /index/_forcemerge?max_num_segments=1  # 合并为 1 个段
4.3:分片分配控制
PUT /_cluster/settings
{
  "transient": { "cluster.routing.allocation.enable": "primaries" }  // 仅分配主分片
}
4.4:分片管理实践

集群磁盘不均,节点 A 使用率 90%,节点 B 仅 30%。手动迁移分片:

POST /_cluster/reroute
{
    "commands": [
        { 
            "move": 
         	{ 
                "index": "my_index", "shard": 0, "to_node": "node_B" 
            } 
        }
    ]
}

5:快照和恢复

1️⃣ 注册仓库

PUT /_snapshot/my_s3_repo
{
    "type": "s3",
    "settings": { 
        "bucket": "my-es-backups", 
        "region": "us-east-1" 
    }
}

2️⃣ 创建快照

PUT /_snapshot/my_s3_repo/snapshot_20231001?wait_for_completion=true

3️⃣ 通过指定的快照恢复

POST /_snapshot/my_s3_repo/snapshot_20231001/_restore
{
    "indices": "orders",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored_$1"
}

6:滚动重启 & 升级

6.1:滚动重启

在保证集群高可用性的前提下,逐个重启节点完成维护操作(如配置更新、内存泄漏修复、硬件维护等)

1️⃣ 禁用分片分配

PUT _cluster/settings
{
    "persistent": {
        "cluster.routing.allocation.enable": "none"
    }
}

2️⃣ 停止索引写入

POST _all/_close

3️⃣ 执行同步刷新(确保数据持久化)

POST _flush/synced

4️⃣ 逐个重启节点

# 停止服务
sudo systemctl stop elasticsearch

# 执行维护操作(如更新配置)
sudo nano /etc/elasticsearch/elasticsearch.yml

# 启动服务
sudo systemctl start elasticsearch

5️⃣ 等待节点恢复

# 监控节点状态
GET _cat/nodes?v&h=name,ip,version,ram.percent,node.role,load_1m

# 检查分片状态
GET _cat/shards?v&h=index,shard,prirep,state,node&s=node

6️⃣ 重新启用分片分配

PUT _cluster/settings
{
    "persistent": {
        "cluster.routing.allocation.enable": null
    }
}

7️⃣ 等待集群变绿

watch -n 5 'curl -sXGET "localhost:9200/_cluster/health?pretty"'
6.2:滚动升级

版本兼容性矩阵

当前版本 可升级版本 升级路径
7.x 8.x 7.17 → 8.0 → 8.x
6.8+ 7.x 6.8 → 7.17 → 8.x
5.x 6.x 需完整集群重启
< 5.x 新版本 需重建索引
控制台 节点1 节点2 节点3 停止服务 升级软件 启动服务 加入集群 监控集群状态 loop [健康检查] 重复流程 重复流程 控制台 节点1 节点2 节点3

1️⃣ 健康检查

GET _cluster/health?pretty
GET _cat/indices?health=red

2️⃣ 启用API检查 -> 必须解决所有 critical 级别问题

GET /_migration/deprecations?pretty

3️⃣ 数据的备份

# 创建快照仓库
PUT _snapshot/upgrade_backup
{
  "type": "fs",
  "settings": {"location": "/mnt/backups/upgrade_2023"}
}

# 执行全量快照
PUT _snapshot/upgrade_backup/snapshot_pre_upgrade?wait_for_completion=true

4️⃣ 插件兼容性的检查

bin/elasticsearch-plugin list
# 下载新版插件包备用

5️⃣ ​禁用分片分配

PUT _cluster/settings
{
    "persistent": {
        "cluster.routing.allocation.enable": "none"
    }
}

6️⃣ 停止非必要的索引

POST .ml*, .apm*, .transform*/_close

7️⃣ 逐个节点升级

# 1. 停服务
sudo systemctl stop elasticsearch

# 2. 升级包 (以RPM为例)
sudo rpm -Uvh elasticsearch-8.10.0-x86_64.rpm

# 3. 解决配置冲突
sudo diff /etc/elasticsearch/elasticsearch.yml.rpmnew \
         /etc/elasticsearch/elasticsearch.yml

# 4. 启动服务
sudo systemctl start elasticsearch

8️⃣ 升级后验证

# 检查节点版本
GET _nodes?filter_path=nodes.*.version

# 测试核心功能
POST test_upgrade/_doc {"message":"upgrade test"}
GET test_upgrade/_search

9️⃣ 重新启用功能

# 开启分片分配
PUT _cluster/settings {"persistent": {"cluster.routing.allocation.enable":null}}

# 打开系统索引
POST .ml*, .apm*, .transform*/_open
6.3:版本升级特例处理

跨大版本升级(7.x -> 8.x)

# 必须步骤:设置兼容标志
PUT _cluster/settings
{
  "persistent": {
    "cluster.indices.close.enable": true
  }
}

安全配置升级

# 8.x默认开启安全
bin/elasticsearch-reset-password -u elastic

分词器重新加载

POST _nodes/reload_secure_settings
6.4:灾难恢复
发现升级问题
停止所有节点
卸载新版本
安装旧版本
恢复配置文件
启动集群
快照恢复

1️⃣ 停止服务

sudo systemctl stop elasticsearch

2️⃣ 降级安装

sudo rpm -Uvh --oldpackage elasticsearch-7.17.10-x86_64.rpm

3️⃣ 恢复配置

cp /etc/elasticsearch/elasticsearch.yml.bak \
   /etc/elasticsearch/elasticsearch.yml

4️⃣ 启动集群

sudo systemctl start elasticsearch

5️⃣ 数据恢复

POST _snapshot/upgrade_backup/snapshot_pre_upgrade/_restore

7:性能调优

7.1:硬件与基础配置
项目 优化建议
JVM Heap ≤ 31GB(避免 Compressed OOPS 失效)
磁盘 SSD(避免 NAS/网络存储)
文件描述符 ulimit -n 65536
虚拟内存 sysctl -w vm.max_map_count=262144
7.2:索引设计的优化

分片数的计算

目标分片大小 30–50GB。

例如:每日数据 100GB,保留 30 天 → 总分片数 = (100GB*30)/40GB ≈ 75

按 5 节点均分,每节点 15 分片

避免字段爆炸

flattened 类型处理动态字段(如 JSON 日志)

7.3:查询优化

filter -> query

"query": {
    "bool": {
        "filter": [{"range": {"timestamp": {"gte": "now-1d"}}}], // 不计算得分
        "must": [{"match": {"product": "phone"}}] // 计算得分
    }
}

深度分页替代方案

GET /index/_search
{
    "size": 100,
    "sort": [{"timestamp": "desc"}, {"_id": "asc"}],
    "search_after": ["2023-10-01T00:00:00", "abcd1234"]
}
7.4:写入优化
参数 优化值 效果
refresh_interval 30s 减少 Lucene 段生成频率
translog.durability async 异步写 translog(风险:丢失 5s 数据)
indices.memory.index_buffer_size 20% 增大索引缓冲区内存占比

二:安全

1:认证与授权

1.1:基础安全

修改elasticsearch.yml

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
1.2:内置用户初始化
bin/elasticsearch-setup-passwords auto  # 自动生成密码
1.3:RBAC 权限控制

RBAC黄金法则

  • 最小权限原则(用户只拥有必要权限)
  • 角色继承机制(创建基础角色+部门扩展角色)
  • 定期权限审计(每月检查权限分配)
用户 User
角色 Role
集群权限 Cluster Privileges
索引权限 Index Privileges
应用权限 Application Privileges
监控 monitor
管理模板 manage_index_templates
安全管理 manage_security
读取 read
写入 index
删除 delete
管理 manage
Kibana 空间访问
权限类别 权限示例 说明
集群权限 monitor, manage, all 控制集群级操作(如节点状态查看、设置修改)
索引权限 read, write, delete_index 控制对特定索引的操作权限
字段级权限 grant: ['*'], except: ['salary'] 控制对索引字段的访问
文档级权限 { "term": { "department": "engineering" } } 基于文档内容的访问控制
应用权限 kibana_admin, logs_viewer 控制 Kibana 等应用的访问

1️⃣ 创建角色(限制访问特定索引)

POST /_security/role/logs_reader
{
    "indices": [
        {
            "names": ["logs-*"],
            "privileges": ["read", "view_index_metadata"]
        }
    ]
}


// 包含字段级别和文档级别的限制
POST /_security/role/finance_auditor
{
  "cluster": ["monitor", "manage_ilm"],
  "indices": [
    {
      "names": ["transactions-*"],
      "privileges": ["read", "view_index_metadata"],
      "field_security": {
        "grant": ["*"],
        "except": ["credit_card_number"]
      },
      "query": {
        "term": { "department": "finance" }
      }
    }
  ],
  "applications": [
    {
      "application": "kibana-.kibana",
      "privileges": ["feature_discover.read", "feature_dashboard.read"],
      "resources": ["space:finance"]
    }
  ]
}

2️⃣ 创建用户并绑定角色:

POST /_security/user/john
{
    "password": "password123",
    "roles": ["logs_reader"]
}

2:传输加密(TLS)

Client Elasticsearch Node1 Node2 Node3 HTTPS (9200) TLS 加密通信 (9300) TLS 加密通信 (9300) TLS 加密通信 (9300) Client Elasticsearch Node1 Node2 Node3
2.1:基本使用

1️⃣ 生成证书

# 创建 CA
bin/elasticsearch-certutil ca \
  --out config/certs/elastic-stack-ca.p12 \
  --pass "CaSecret@2023"

# 生成节点证书(包含所有节点DNS和IP)
bin/elasticsearch-certutil cert \
  --ca config/certs/elastic-stack-ca.p12 \
  --ca-pass "CaSecret@2023" \
  --out config/certs/elastic-nodes.p12 \
  --pass "NodeCertPass!2023" \
  --dns node1.escluster,node2.escluster,node3.escluster \
  --ip 192.168.1.101,192.168.1.102,192.168.1.103

2️⃣ 配置elasticsearch.yml

# ======== 节点间通信加密 ========
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/elastic-nodes.p12
  keystore.password: "NodeCertPass!2023"
  truststore.path: certs/elastic-nodes.p12
  truststore.password: "NodeCertPass!2023"

# ======== HTTP API 加密 ========
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/elastic-nodes.p12
  keystore.password: "NodeCertPass!2023"
  verification_mode: certificate

# ======== 高级安全配置 ========
# 启用客户端证书认证
xpack.security.http.ssl.client_authentication: optional
# 强制加密通信
xpack.security.authc.token.enabled: true
xpack.security.authc.api_key.enabled: true
2.2:Kibana安全连接配置

在kibana.yml中配置

# Elasticsearch 加密连接
elasticsearch.hosts: ["https://node1.escluster:9200"]
elasticsearch.ssl:
  certificateAuthorities: /path/to/elastic-stack-ca.crt
  verificationMode: certificate

# Kibana 服务端HTTPS
server.ssl:
  enabled: true
  certificate: /path/to/kibana.crt
  key: /path/to/kibana.key

# 安全凭证
elasticsearch.username: "kibana_system"
elasticsearch.password: "StrongKibanaPass!2023"
2.3:客户端证书认证
# 生成客户端证书
bin/elasticsearch-certutil cert \
  --ca config/certs/elastic-stack-ca.p12 \
  --ca-pass "CaSecret@2023" \
  --name "api-client" \
  --out config/certs/api-client.p12

# 使用证书访问
curl -E config/certs/api-client.p12 \
  --pass "ClientCertPass" \
  https://escluster:9200/_cluster/health
2.4:证书自动轮换
# 每月1号自动轮换证书
0 0 1 * * /opt/elasticsearch/scripts/rotate_certs.sh
#!/bin/bash
# rotate_certs.sh
NEW_PASS=$(openssl rand -base64 16)

# 生成新证书
bin/elasticsearch-certutil cert ... --pass $NEW_PASS

# 动态更新节点
PUT /_nodes/reload_secure_settings
{
  "secure_settings_password": "OldPass@2023"
}

# 更新配置文件
sed -i "s/NodeCertPass!2023/$NEW_PASS/g" elasticsearch.yml

# 重启服务(滚动重启)
systemctl restart elasticsearch
2.5:安全审计配置
# elasticsearch.yml
xpack.security.audit.enabled: true
xpack.security.audit.logfile.events.include: access_denied,anonymous_access_denied,authentication_failed,tampered_request,connection_denied
xpack.security.audit.logfile.events.exclude: authentication_success