Skip to content

Conversation

@maxtropets
Copy link
Collaborator

@maxtropets maxtropets commented Dec 23, 2025

✅ Long Test passed

It was noticed during one of the flakey long-test runs with huge commit delays, that the very first endorsement fetched may be not the one we expect (current services endorses previous), but rather previous service endorsing it's predecessor (or self-endorsed).

The original freshness check only checked the local state machine for PartOfNetwork, but this can async with the service updating TX on the follower.

In this case, the follower updates it here, and it can happen that the opening TX hasn't reached it yet.

Ultimately the node fails to validate the chain, as it's not linked to the current service identity, so no data corruption here, but it's easy enough to prevent this, and we're doing so in this PR.

I reproduced this manually (and tested the fix) by hardcoding delays in the source code, but struggled to find out a reasonably cheap and reliable e2e test flow for this, therefore this PR doesn't introduce any.

UPD: another repro: https://github.com/microsoft/CCF/actions/runs/20489719417/job/58879405016?pr=7533

@maxtropets maxtropets self-assigned this Dec 23, 2025
Copilot AI review requested due to automatic review settings December 23, 2025 17:39
@maxtropets maxtropets requested a review from a team as a code owner December 23, 2025 17:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition during back-endorsement fetching that can occur when a node's internal state machine advances to "part-of-network" before the service opening transaction has been fully replicated. The fix adds a check for ServiceStatus::OPEN to ensure the service opening transaction is present before fetching endorsements.

Key Changes:

  • Added a status check to wait for ServiceStatus::OPEN before fetching the first endorsement
  • Added explanatory comments describing the race condition scenario
  • Uses existing retry_first_fetch() mechanism to handle the asynchronous state update

@maxtropets maxtropets added run-long-test Run Long Test job and removed run-long-test Run Long Test job labels Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant