Self healing open #7189

cjen1-msft · 2025-08-15T10:15:55Z

This PR is the reification of: #7003

The idea is that if a service has fully crashed, it should be able to heal itself so long as it isn't too damaged.
Specifically, the restarting nodes should gossip the knowledge they have locally to try and elect the replica with the best local state.

The result is that after the self-healing-open one replica is chosen to recover and open, while all others restart to then join it.

The protocol can be roughly surmised as:

Start up.
Gossip your state (claimed length of ledger) and authenticate other replicas attestation and identities.
Once you have heard from everyone you expect to hear from, vote for the node with the longest ledger (ties broken by identity).
If you receive votes from a majority of your expected cluster, transition_to_open and broadcast IAmOpen to the other nodes.
If you receive IAmOpen from a trusted node, restart and join it.

This still requires the submission of ledger recovery shares, however if local sealing is available those can be used instead.

UPDATE:
To be specific on when and where things happen.

After public ledger recovery, during create rpc it will reset the self-healing-open state and once that state is committed start several timers, one to resend gossips/votes/iamopen messages, and one to advance the failover state-machine
All messages include an attestation that is validated on receipt, and the handlers all make a change to the KV state and then try to advance the protocol.
Finally once the protocol is complete either:
- The chosen replica runs transition-to-open and on the next timer tick stops the timers.
- The others shutdown, to later be restarted by the host's orchestrator where they will try and join the network, landing on the chosen replica.

Co-authored-by: Eddy Ashton <[email protected]>

src/node/self_healing_open_impl.cpp

achamayou · 2025-12-18T13:38:03Z

src/node/self_healing_open_impl.cpp

+
+        auto url = fmt::format(
+          "https://{}/{}/self_healing_open/timeout",
+          node_state->config.network.rpc_interfaces.at("primary_rpc_interface")


I think we want this to take place over the same rpc interface we use for snapshot download, joining etc, which is probably not the primary as that tends to be exposed publicly. Perhaps that's a configuration option? Perhaps we need to name that interface.

I've updated this such that all self-healing-open messages pass through the same interface, which in theory should be on a private vnet etc, and is set in the config.
In general having a concept of an internal interface (one used for snapshots, joining, ledgers etc) would be useful (as we can centrally manage it), but that is probably a bigger discussion we should have in the new year.

src/node/self_healing_open_impl.cpp

tests/infra/network.py

achamayou · 2025-12-18T13:46:40Z

tla/disaster-recovery/src/main.rs

+fn check(model: ActorModel<Node, ModelCfg, ()>) {
+    let checker = model
+        .checker()
+        //.symmetry()


Afaicr symmetry can interact badly with the eventually properties.

tla/disaster-recovery/Cargo.toml

doc/operations/recovery.rst

Co-authored-by: Amaury Chamayou <[email protected]>

cjen1-msft and others added 30 commits July 7, 2025 13:18

Initialise request

32d1361

Fix handler

f88d7b5

fiddle with pointers

4458c8b

Fix timeout

cdebe29

Maybe fix issue?

4ea2bb7

refmt

6214b6c

Merge branch 'main' into curlm

a4be0c3

Update

fce77da

fmt

58eb20c

remove static_cast

5b52e3d

Fix url query

b876cca

Add kickstart for curlm and document interaction between libuv and curlm

68aff99

Refactor interface to make checks more careful.

934010f

move to a constructor pattern

c84ba3f

Add missing nullptr check in curl_socket_callback

594f536

Update src/http/curl.h

333c427

Co-authored-by: Eddy Ashton <[email protected]>

Add check and warn of duplicate headers in responses

12edb67

Migrate fetch.h to new interface

14d827b

fix

6c44deb

Pass through config bits for self-heal-open

1fbd015

Update test infra to test self-healing-open

c790f4d

Fix undefined request body and multi-threaded access to curl

b153e53

Runnable checkpoint

fee3559

Config changes

dc6a7ee

Add timeouts

2058180

Fix curl put with empty body issue

f8981ae

Add test for timeouts

963b6c1

Get open working

4d22d82

Get join working (still requires trusting of replacement nodes)

c67f032

Changes to prevent repeated joins

207b142