MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ
Failure in distributed systems is normal. Distributed systems can provide only two of the three guarantees in consistency, availability, and partition tolerance. According to Kevlin Henney, this limits how much you can know about how a distributed system will behave. He gave a keynote about Six Impossible Things at QCon London 2022 and at QCon Plus May 10-20, 2022.
When you look at a piece of code, you can see simple control flow, such as sequence and branching, reflected in the constructs and indentation of the source. What is not visible is how time is lost or reordered in the network or the backend, Henney argued.
Because failure – routers, cabling, hosts, middleware, etc. – is normal, it means that any view of state relying on having everything available all of the time is impossible. Henney gave an example of what can happen:
If a node crashes and never comes back up, but you have to wait for it to give the user an answer, the user will never receive an answer. The architect is not in a position to guarantee that the node will become available again. Alternatively, you give the user a cached answer, but that result may be stale. In either case, the user never receives the true answer.
The CAP Theorem is the Heisenberg Uncertainty Principle of distributed systems: consistency, availability and partition tolerance. You cannot, however, have all three properties at once, as Henney explained:
For consistency, returned results are either up-to-date and consistent, or an error is received. For availability, there are no error responses, but there is no guarantee the result is up-to-date and consistent. For partition tolerance, this refers to the ability of a distributed system to keep functioning in the presence of parts of the network becoming inaccessible. Pick two, you can’t have them all – or, put another way, you can’t know it all.
Timeouts and caching play a significant role in coding distributed systems, not because they are workarounds and optimisation tweaks, but because they are fundamentally necessary, Henney concluded.
InfoQ interviewed Kevlin Henney about distributed systems.
InfoQ: Why do people assume that distributed systems are knowable?
Kevlin Henney: Developers typically view the world through their IDE, from the perspective of a single machine. They default to reasoning about a system through the lens of the code that is immediately in front of them. Although they are consciously aware of the network as a source of failure, concurrency, asynchrony and latency, their porthole view does not necessarily cause them to take a step back and appreciate what they consider situational – a problem in this case – is actually foundational; that it arises from the nature of the system.
Although exception handlers might acknowledge failures, they don’t reflect their normality. I’ve seen more than one coding guideline state that “exceptions should be exceptional”. This advice is little more than wordplay – it’s neither helpful nor realistic. There is nothing exceptional about timeouts, disconnects and other errors in a networked environment – these are “business as usual”.
These “little issues” are also not little: they shape – in fact, they are – the fabric from which a distributed system is made.
InfoQ: How can we deal with the uncertainty that comes with distributed systems?
Henney: First of all, acknowledge uncertainty exists. Second, acknowledge that failure and incomplete information is a normal state of affairs, not an exceptional one. Third, decide what data quality is appropriate for the application.
The limits to certainty and knowability mean that architects must recognise that state coherence is a fundamental decision and not an implementation detail. Whether data is eventually consistent or synchronously transactional shapes the code, the user experience and many other qualities of a system. If a user makes a change, is that change immediately and necessarily visible to other users? The answer to that question can change the architecture. Being too strict about the data quality will either run afoul of the CAP theorem or will have unfortunate performance consequences, such as distributed locking.
These constraints on distribution are why, for example, distributed garbage collection mechanisms, such as Java’s RMI, are probabilistic in their determination of whether an object is no longer needed and can, therefore, be collected. In a single process in a managed language, it is possible to determine whether an object is referred to or not as a statement of fact; in a distributed system, it is not. A leasing mechanism based on timeouts is used to determine whether it’s likely that an object is still referred to or not by remote processes. But it does not guarantee that determination is actually correct, just probably correct.