Nacos1.x Raft算法实现-EW帮帮网

Nacos1.x Raft算法实现

Nacos作为一款分布式注册中心，在一致性的构架设计上支持了AP以及CP，其中CP使用了Raft来作为其强一致性的算法实现，本文基于Nacos 1.x版本分析其部分功能的源码实现

Raft算法的论文介绍参考：https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md

RaftCore

初始化方法中主要有几个任务

从持久化文件中加载元数据到内存
注册Leader选举任务
注册Leader心跳任务

这里元数据和非常核心的数据结构，数据以k-v形式存在，key一般由数据类型限定符+服务标识组成，value则为具体的数据，以服务发现场景为例

// 服务实例列表（Instances）
com.alibaba.nacos.naming.iplist.public##MyService
{
    "instanceList": [
        {
            "ip": "192.168.1.100",
            "port": 8080,
            "weight": 1.0,
            "healthy": true
        },
        {
            "ip": "192.168.1.101",
            "port": 8081,
            "weight": 1.0,
            "healthy": true
        }
    ]
}
// 服务元数据（Service）
com.alibaba.nacos.naming.service.meta.public##MyService
{
    "name": "MyService",
    "protectThreshold": 0.5,
    "metadata": {
        "version": "1.0.0",
        "owner": "teamA"
    }
}

初始化#init()

public void init() throws Exception {

	...
	// 从持久化文件中加载元数据到内存
    datums = raftStore.loadDatums(notifier);

    setTerm(NumberUtils.toLong(raftStore.loadMeta().getProperty("term"), 0L));
    // 注册Leader选举任务
    GlobalExecutor.registerMasterElection(new MasterElection());
    // 注册Leader心跳任务
    GlobalExecutor.registerHeartbeat(new HeartBeat());
	...
}

Leader选举#MasterElection

在Raft算法中Leader是一个非常重要的角色，因为数据的写操作都只能提到给Leader节点完成，所以确定一个Leader是一件非常重要的事情，在上面init方法中启动了一个周期选举任务，周期间隔是TICK_PERIOD_MS（500ms),从节点一旦在设定的LEADER_TIMEOUT_MS没有接收到心跳，则认为Leader节点可能出现故障，将重新发起Leader的选举。

public static final long LEADER_TIMEOUT_MS = TimeUnit.SECONDS.toMillis(15L);
// Leader超时时间
public volatile long leaderDueMs = RandomUtils.nextLong(0, GlobalExecutor.LEADER_TIMEOUT_MS);

这里的超时时间之所以是一个0~15s的随机值,是为了出现，Leader节点一挂，所有从节点出现并发选举的情况

public void run() {
    try {

        if (!peers.isReady()) {
            return;
        }

        RaftPeer local = peers.local();
        local.leaderDueMs -= GlobalExecutor.TICK_PERIOD_MS;

        if (local.leaderDueMs > 0) {
            return;
        }

        // reset timeout
        local.resetLeaderDue();
        local.resetHeartbeatDue();
        // 发起投票重新选举Leader
        sendVote();
    } catch (Exception e) {
        Loggers.RAFT.warn("[RAFT] error while master election {}", e);
    }

}

发起投票#sendVote()

发起投票时，发起方将成为候选者，总是投票给直接，并向所有除自身节点外的节点发起投票请求，并根据综合投票结果决定出Leader

public void sendVote() {

    RaftPeer local = peers.get(NetUtils.localServer());
    Loggers.RAFT.info("leader timeout, start voting,leader: {}, term: {}",
        JSON.toJSONString(getLeader()), local.term);

    peers.reset();

    local.term.incrementAndGet();
    local.voteFor = local.ip;
    local.state = RaftPeer.State.CANDIDATE;

 	...
    for (final String server : peers.allServersWithoutMySelf()) {
        final String url = buildURL(server, API_VOTE);
        try {
            HttpClient.asyncHttpPost(url, null, params, new AsyncCompletionHandler<Integer>() {
                @Override
                public Integer onCompleted(Response response) throws Exception {
                    ...
                    peers.decideLeader(peer);
                    return 0;
                }
            });
        } catch (Exception e) {
            Loggers.RAFT.warn("error while sending vote to server: {}", server);
        }
    }
}

收到投票#receivedVote()

当节点接收到投票以后，秉承着先到先得的原则，如果当前任期内已经投票过了，将直接告知发起者自己的认同的节点ip，否则无条件的支持

public RaftPeer receivedVote(RaftPeer remote) {
	...
    RaftPeer local = peers.get(NetUtils.localServer());
    // 候选者请求的任期小于等于当前任务，返回自己的认同的节点ip
    if (remote.term.get() <= local.term.get()) {
        String msg = "received illegitimate vote" +
            ", voter-term:" + remote.term + ", votee-term:" + local.term;

        Loggers.RAFT.info(msg);
        if (StringUtils.isEmpty(local.voteFor)) {
            local.voteFor = local.ip;
        }

        return local;
    }

    local.resetLeaderDue();
	// 无条件的支持候选者的请求
    local.state = RaftPeer.State.FOLLOWER;
    local.voteFor = remote.ip;
    local.term.set(remote.term.get());

    Loggers.RAFT.info("vote {} as leader, term: {}", remote.ip, remote.term);

    return local;
}

确定Leader#decideLeader()

多数人原则，即超过半数

public int majorityCount() {
    return peers.size() / 2 + 1;
}

最终拥有票数超过总节点一半的节点将成为Leader

public RaftPeer decideLeader(RaftPeer candidate) {
    peers.put(candidate.ip, candidate);

    SortedBag ips = new TreeBag();
    int maxApproveCount = 0;
    String maxApprovePeer = null;
    for (RaftPeer peer : peers.values()) {
        if (StringUtils.isEmpty(peer.voteFor)) {
            continue;
        }

        ips.add(peer.voteFor);
        if (ips.getCount(peer.voteFor) > maxApproveCount) {
            maxApproveCount = ips.getCount(peer.voteFor);
            maxApprovePeer = peer.voteFor;
        }
    }
	// 投票数大于
    if (maxApproveCount >= majorityCount()) {
        RaftPeer peer = peers.get(maxApprovePeer);
        peer.state = RaftPeer.State.LEADER;

        if (!Objects.equals(leader, peer)) {
            leader = peer;
            Loggers.RAFT.info("{} has become the LEADER", leader.ip);
        }
    }

    return leader;
}

心跳同样是在init启动的周期性的定时任务，任务间隔同样是500ms，但只有超过了心跳超时时间才会真正的发送心跳

// 心跳超时时间 0~5s
public static final long HEARTBEAT_INTERVAL_MS = TimeUnit.SECONDS.toMillis(5L);
public volatile long heartbeatDueMs = RandomUtils.nextLong(0, GlobalExecutor.HEARTBEAT_INTERVAL_MS);

Leader心跳#HeartBeat

public void run() {
    try {

        if (!peers.isReady()) {
            return;
        }

        RaftPeer local = peers.local();
        local.heartbeatDueMs -= GlobalExecutor.TICK_PERIOD_MS;
        if (local.heartbeatDueMs > 0) {
            return;
        }

        local.resetHeartbeatDue();

        sendBeat();
    } catch (Exception e) {
        Loggers.RAFT.warn("[RAFT] error while sending beat {}", e);
    }

}

发送心跳#sendBeat()

首先需要明确，只有领导则能够发送心跳，发送心跳主要主要有以下几个目的

维护Leader权威：发送心跳会重置选举超时时间，并且告知从节点自己成为了Leader
同步元数据到从节点：元数据统一由Leader节点写入，但需要把这些数据都同步给从节点学习
同步从从节点数据：主节点需要知道从节点的基本数据

datums就是是所有元数据的集合，从这里我们可以看到，每份元数据都包含timesatmp,这其实相当于数据的版本信息，作为数据有效性的一依据。

// 原始key
public static final String SERVICE_META_KEY_PREFIX = "com.alibaba.nacos.naming.domains.meta.";
public static final String BRIEF_SERVICE_META_KEY_PREFIX = "meta.";
// 缩略，只保留meta.
public static String briefServiceMetaKey(String key) {
	// com.alibaba.nacos.naming.domains.meta.##MyService=>meta.##MyService
    return BRIEF_SERVICE_META_KEY_PREFIX + key.split(SERVICE_META_KEY_PREFIX)[1];
}

并且我们可以看到key在这里被缩略化，也是体现了开发者对数据传输减少数据大小的优化之处

public void sendBeat() throws IOException, InterruptedException {
    RaftPeer local = peers.local();
    if (local.state != RaftPeer.State.LEADER && !STANDALONE_MODE) {
        return;
    }
	// 重置选举超时时间
    local.resetLeaderDue();
    
    // 压缩
    String content = JSON.toJSONString(params);
    GZIPOutputStream gzip = new GZIPOutputStream(out);
    gzip.write(content.getBytes("UTF-8"));
    gzip.close();
	...
    if (!switchDomain.isSendBeatOnly()) {
        for (Datum datum : datums.values()) {
            JSONObject element = new JSONObject();
            if (KeyBuilder.matchServiceMetaKey(datum.key)) {
                element.put("key", KeyBuilder.briefServiceMetaKey(datum.key));
            } else if (KeyBuilder.matchInstanceListKey(datum.key)) {
                element.put("key", KeyBuilder.briefInstanceListkey(datum.key));
            }
            element.put("timestamp", datum.timestamp);
            array.add(element);
        }
    } else {
        Loggers.RAFT.info("[RAFT] send beat only.");
    }

    packet.put("datums", array);
    ...
    // 同步元数据到从节点
    for (final String server : peers.allServersWithoutMySelf()) {
         HttpClient.asyncHttpPostLarge(url, null, compressedBytes, new AsyncCompletionHandler<Integer>() {
         @Override
        public Integer onCompleted(Response response) throws Exception {
			// 同步从从节点数据
            peers.update(JSON.parseObject(response.getResponseBody(), RaftPeer.class));
            	return 0;
       		}
        });
    }

}

接收心跳receivedBeat()

从节点接收到心跳如果自身标记为从节点，即可标记，并且重置自身的选举超时时间和心跳超时时间，并且重新确定集群中所有节点的ip和身份。

根据datums的key累计到一定数量后批量获取到元数据的的值，更新内存datums值并做持久化

public RaftPeer receivedBeat(JSONObject beat) throws Exception {
		...
      
    // 接收到信息，如果自身未标记为从节点，即刻标记为从节点
    if (local.state != RaftPeer.State.FOLLOWER) {

        Loggers.RAFT.info("[RAFT] make remote as leader, remote peer: {}", JSON.toJSONString(remote));
        // mk follower
        local.state = RaftPeer.State.FOLLOWER;
        local.voteFor = remote.ip;
    }

    final JSONArray beatDatums = beat.getJSONArray("datums");
  	// 重置超时时间
    local.resetLeaderDue();
    local.resetHeartbeatDue();
		// 重新确定集群中所有节点的ip和身份
    peers.makeLeader(remote);

    Map<String, Integer> receivedKeysMap = new HashMap<String, Integer>(datums.size());

    for (Map.Entry<String, Datum> entry : datums.entrySet()) {
        receivedKeysMap.put(entry.getKey(), 0);
    }

    // now check datums
    List<String> batch = new ArrayList<String>();
    if (!switchDomain.isSendBeatOnly()) {
        int processedCount = 0;
        for (Object object : beatDatums) {
            processedCount = processedCount + 1;

            JSONObject entry = (JSONObject) object;
            String key = entry.getString("key");
            final String datumKey;

            if (KeyBuilder.matchServiceMetaKey(key)) {
                datumKey = KeyBuilder.detailServiceMetaKey(key);
            } else if (KeyBuilder.matchInstanceListKey(key)) {
                datumKey = KeyBuilder.detailInstanceListkey(key);
            } else {
                // ignore corrupted key:
                continue;
            }

            long timestamp = entry.getLong("timestamp");

            receivedKeysMap.put(datumKey, 1);

            try {
              	// 根据timestamp检查datum的有效期
                if (datums.containsKey(datumKey) && datums.get(datumKey).timestamp.get() >= timestamp && processedCount < beatDatums.size()) {
                    continue;
                }

                if (!(datums.containsKey(datumKey) && datums.get(datumKey).timestamp.get() >= timestamp)) {
                    batch.add(datumKey);
                }
								// 累计50处理
                if (batch.size() < 50 && processedCount < beatDatums.size()) {
                    continue;
                }

                String keys = StringUtils.join(batch, ",");

                if (batch.size() <= 0) {
                    continue;
                }

                // 根据datum key 获取到实际的datum value，同步到自身的datums中，并做持久化
                String url = buildURL(remote.ip, API_GET) + "?keys=" + URLEncoder.encode(keys, "UTF-8");
                HttpClient.asyncHttpGet(url, null, null, new AsyncCompletionHandler<Integer>() {
                    @Override
                    public Integer onCompleted(Response response) throws Exception {
              					...
                        for (JSONObject datumJson : datumList) {
                            OPERATE_LOCK.lock();
                            Datum newDatum = null;
                            try {

                                Datum oldDatum = getDatum(datumJson.getString("key"));
																...
                                // 持久化元数据
                                raftStore.write(newDatum);

                                datums.put(newDatum.key, newDatum);
                                notifier.addTask(newDatum.key, ApplyAction.CHANGE);

                                local.resetLeaderDue();

                              	// 尝试更新自身任期
                                if (local.term.get() + 100 > remote.term.get()) {
                                    getLeader().term.set(remote.term.get());
                                    local.term.set(getLeader().term.get());
                                } else {
                                    local.term.addAndGet(100);
                                }

                                raftStore.updateTerm(local.term.get());

                            } catch (Throwable e) {
                                Loggers.RAFT.error("[RAFT-BEAT] failed to sync datum from leader, datum: {}", newDatum, e);
                            } finally {
                                OPERATE_LOCK.unlock();
                            }
                        }
                    }
                });

                batch.clear();

            } catch (Exception e) {
                Loggers.RAFT.error("[NACOS-RAFT] failed to handle beat entry, key: {}", datumKey);
            }

        }
				...
       	// 移除无效的datum
        for (String deadKey : deadKeys) {
            try {
                deleteDatum(deadKey);
            } catch (Exception e) {
                Loggers.RAFT.error("[NACOS-RAFT] failed to remove entry, key={} {}", deadKey, e);
            }
        }

    }

    return local;
}

元数据发布#signalPublish()

所有的元数据都由Leader节点写入，并同步给所有从节点学习，这里使用了CountDownLatch，得到超过半数节点响应成功即视为写入成功

public void signalPublish(String key, Record value) throws Exception {

    if (!isLeader()) {
        JSONObject params = new JSONObject();
        params.put("key", key);
        params.put("value", value);
        Map<String, String> parameters = new HashMap<>(1);
        parameters.put("key", key);
				// 转发给Leader
        raftProxy.proxyPostLarge(getLeader().ip, API_PUB, params.toJSONString(), parameters);
        return;
    }

    try {
        OPERATE_LOCK.lock();
        long start = System.currentTimeMillis();
        final Datum datum = new Datum();
        datum.key = key;
        datum.value = value;
        if (getDatum(key) == null) {
            datum.timestamp.set(1L);
        } else {
            datum.timestamp.set(getDatum(key).timestamp.incrementAndGet());
        }

        JSONObject json = new JSONObject();
        json.put("datum", datum);
        json.put("source", peers.local());
				// Leader节点自身写入元数据
        onPublish(datum, peers.local());

        final String content = JSON.toJSONString(json);
				// 等待数同样设置为总节点的半数
        final CountDownLatch latch = new CountDownLatch(peers.majorityCount());
        for (final String server : peers.allServersIncludeMyself()) {
            if (isLeader(server)) {
                latch.countDown();
                continue;
            }
            final String url = buildURL(server, API_ON_PUB);
            HttpClient.asyncHttpPostLarge(url, Arrays.asList("key=" + key), content, new AsyncCompletionHandler<Integer>() {
                @Override
                public Integer onCompleted(Response response) throws Exception {
                    if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
                        Loggers.RAFT.warn("[RAFT] failed to publish data to peer, datumId={}, peer={}, http code={}",
                            datum.key, server, response.getStatusCode());
                        return 1;
                    }
                    latch.countDown();
                    return 0;
                }

                @Override
                public STATE onContentWriteCompleted() {
                    return STATE.CONTINUE;
                }
            });

        }
				// 如果没有超过半数响应成功，则抛出异常
        if (!latch.await(UtilsAndCommons.RAFT_PUBLISH_TIMEOUT, TimeUnit.MILLISECONDS)) {
            // only majority servers return success can we consider this update success
            Loggers.RAFT.error("data publish failed, caused failed to notify majority, key={}", key);
            throw new IllegalStateException("data publish failed, caused failed to notify majority, key=" + key);
        }

        long end = System.currentTimeMillis();
        Loggers.RAFT.info("signalPublish cost {} ms, key: {}", (end - start), key);
    } finally {
        OPERATE_LOCK.unlock();
    }
}

元数据确认#onPublish()

从节点接收到元数据会把元数据同步到内存，但只有在元数据具有持久化标识时才会做持久化。

public static final String INSTANCE_LIST_KEY_PREFIX = "com.alibaba.nacos.naming.iplist.";

public static boolean matchEphemeralInstanceListKey(String key) {
    return key.startsWith(INSTANCE_LIST_KEY_PREFIX + EPHEMERAL_KEY_PREFIX);
}
public static boolean matchPersistentKey(String key) {
	return !matchEphemeralKey(key);
}

从这里可以看到只有iplist是非持久化的，如实例本身元数据始终是需要持久化化。

而后Leader更新任期，从节点同步Leader的任期数据（这点上也标准的Raft也有所不同，标准的Raft只会再选举时更新任期）

public void onPublish(Datum datum, RaftPeer source) throws Exception {
		...
    // 发布节点的任期小于当前节点记录的任期直接拒绝
    if (source.term.get() < local.term.get()) {
        Loggers.RAFT.warn("out of date publish, pub-term: {}, cur-term: {}",
            JSON.toJSONString(source), JSON.toJSONString(local));
        throw new IllegalStateException("out of date publish, pub-term:"
            + source.term.get() + ", cur-term: " + local.term.get());
    }

    local.resetLeaderDue();

    // if data should be persistent, usually this is always true:
    if (KeyBuilder.matchPersistentKey(datum.key)) {
        raftStore.write(datum);
    }

    datums.put(datum.key, datum);

  	// Leader更新任期，从节点同步Leader的任期数据
    if (isLeader()) {
        local.term.addAndGet(PUBLISH_TERM_INCREASE_COUNT);
    } else {
        if (local.term.get() + PUBLISH_TERM_INCREASE_COUNT > source.term.get()) {
            //set leader term:
            getLeader().term.set(source.term.get());
            local.term.set(getLeader().term.get());
        } else {
            local.term.addAndGet(PUBLISH_TERM_INCREASE_COUNT);
        }
    }
    raftStore.updateTerm(local.term.get());

    notifier.addTask(datum.key, ApplyAction.CHANGE);

    Loggers.RAFT.info("data added/updated, key={}, term={}", datum.key, local.term);
}

总结

Nacos 1.x中对标准Raft算法做了一个简单的实现，并没有严格遵守它。而在Nacos2.x中废弃了这个实现，改用蚂蚁通用的JRaft。

Nacos1.x Raft算法实现