graydon2: (Default)
graydon2 ([personal profile] graydon2) wrote2025-07-27 11:29 pm

consensus

I've been working in and around consensus systems for a long time now and I still find a lot of the writing on the subject very vexing. The protocols are often very complex and very subtle and it is often very hard to see the forest for the trees.

I decided this weekend to try to write down what I am fairly sure I actually know about consensus algorithms, in a form I could fit on one page. Here it is!

Consensus



We want some computers to agree on a thing, but there are problems:

  1. The speed of light
  2. Computers can stop running
  3. Computers can keep running but be disconnected
  4. Computers can keep running but have bugs and lie


We have some tricks that we have found help, that most consensus algorithms are some combination of:

  1. Promises
  2. Voting
  3. Preemption
  4. Epistemic levels (I know that you know that I know...)
  5. Digital signatures


In particular:

  1. If you promise me you'll do something in the future, and put that in an envelope, then even if I don't get the message for a while because of the speed of light I will know at least a little about your current instantaneous state when I get the message: that you're doing, or about to do, or have done the thing you promised (unless you crashed or lied). It narrows the space of possible futures: if you're an honest node, you won't be doing some other thing than what you promised. Faster-than-light communication!

  2. If we agree not to do a thing unless there's a majority passing a vote then:

    1. We can pick a fault probability and then size the number of nodes to ensure that vote-majorities are achievable even with the expected worst case number of simultaneous faults, which handles most cases of nodes stopping.

    2. We know there can't be two different things decided by a majority because there can only be one majority, which helps with some bugs (though not lies) and races due to the speed of light.

    3. We know that if there's a network partition then only one side can possibly get to a majority, and the other can only get a minority and never observe the majority, which transforms the nodes in the minority into a special case of "stopping".


  3. If we agree on a mechanism where a vote initiated by one node can be preempted by another node initiating a new vote, then we can handle any case where the initiating node itself fails or is isolated in a partition or is slow or whatever.

  4. If you tell me not just what you know about yourself but also what you know about other nodes, or even what those nodes told you about what they know about other nodes (ideally conveyed with digital signatures so I know you aren't lying about what others said to you), then:

    1. I can ask you to tell me both your vote and, later, what you saw as the outcome of everyone else's voting. This further limits the set of possible futures of the system: once I know a majority say that a majority voted for X then the decision for X is no longer subject to being lost or reversed due to even massive node stoppage or network failure. Any one node in that majority has enough evidence to convince anyone else that X is the majority decision. A node can only decide not-X after that point if someone was buggy or lying.

    2. I (or other correct-and-honest nodes) can compare those observations and catch any bugs-or-liars, at least up to some given scale of collusion, because some buggy-or-lying node L would by definition be making statements inconsistent with the protocol (eg. "I vote X" as part of a majority, but later "I saw the majority including myself vote Y"). And this "say X and also Y" inconsistency will be caught. If L sends this pair of inconsistent messages directly to some correct-and-honest node M, then M immediately knows L is buggy or lying. If L sends the X message to M and the Y message to some other correct-and-honest node N, it still gets caught when M and N exchange their statements-received-from-L. And this covers both intentional lies and bugs -- though an intentional liar knowing this part of the protocol exists will not bother to lie because they know it will be caught, so it only leaves bugs.




I think that's it! There's also some more stuff about livelocks, timeouts, leadership, fast paths, censorship resistance, fairness, pipelining and information dependencies that all seem like .. eh .. details? Fine tuning? Perhaps just "easier to digest in small pieces if you can keep the big picture in mind". Also much more variable between different protocol families.

Anyway I hope this is useful to someone besides myself. At least I can print it out and stick it on my wall when I am next trying to decide what makes a give protocol safe. Also if you think I've either overlooked one of the main big picture elements and/or gotten one of them wrong, please let me know!