If writing distributed algorithms is hard, proving they work with the many permutations of state in a highly concurrent distributed architecture is next to impossible. To mitigate this risk, typically you would write a comprehensive set of unit and integration tests and couple this with load and system failure testing to determine correctness both of the algorithm and the implementation.
However, testing has its limits. It is difficult to simulate the many variations of system stalling, network latency and failure and to determine the consequences to other system components of resulting back pressure.
In addition, discovering and fixing a flaw in an algorithm that has passed all of your tests but fails in production is far more costly both in time and reputation than identifying the issue at the point off inception. It was with this in mind that I started looking at TLA plus.
TLA plus is a set of tools for specifying and modelling algorithms. It consists of the TLA specification and PlusCal algorithm language and a set of associated tools for writing and modelling. It was designed by Leslie Lamport.
The TLA specification language is primarily based on set theory ZFC to precisely describe the algorithm as a ‘specification’. In addition it uses Temporal Logic to define the liveness of the algorithm. A model can then be applied to the specification to validate and test it before implementation in code.
TLA is particularly good for validating concurrent and distributed algorithms. The model allows the algorithm to be checked both for Safety (does it perform as expected) and Liveness (how optimally does it do it and what happens if parts of the algorithm run slow or fail). It will also identify whether the algorithm can deadlock or stall frequently.
How it works in lay terms
A TLA plus specification consists of a set of statements that change state based on given logic. Each statement can be considered to be atomic. Each set of statements can in turn be run as a set, thus emulating multiple processes.
When running the model against the specification, it will calculate all possible permutations of initial state using a breath first approach. Subsequent results are fed into the next layer until all steps are run. The model will detect deadlocks (unreachable state).
Temporal logic is used to determine when done is done and allows ‘liveness’ to be factored in. That is, it is possible to write specifications where no state change takes place until another event triggers this. This is known as stuttering and is tolerated within limits. An infinitely stuttering specification is by definition deadlock.
Simple example
Consider a simple two state co-ordination problem. Two processes exist that do some work. The work must be done alternatively, that is the sequence of processes executing must always be A..B..A..B..A
. One approach to achieve this is to use a producer / consumer pattern. Expressed in TLA a naïve implementation might look like this:-
EXTENDS Integers, Sequences
Put(s) == Append(s, "widget")
Get(s) == Tail (s)
VARIABLES b, box
vars == << b, box >>
ProcSet == {0} \cup {1}
Init == (* Global variables *)
/\ b = 0
/\ box = <<>>
Producer == /\ b = 0
/\ box' = Put(box)
/\ b' = 1
Consumer == /\ b = 1
/\ box' = Get(box)
/\ b' = 0
Next == Producer \/ Consumer
Spec == /\ Init /\ [][Next]_vars
In simple terms, this defines the start state as a predicate
Init
the next state
Next
and the Temporal formula
Spec
This then allows all possible states of the specification to be determined via the model checker.
Safety – Checking the model
The model checker will calculate all possible states based on the specification i.e.
b = 0
box = << >>
b = 0
box = << widget >>
etc. To validate the model, the tuple box has been used to append state. The assertion is that the length of box cannot exceed 1 at any given time. That is, the length of box can be either 0 or 1 for any given state; otherwise the algorithm does not describe an alternating process.
Running the model checker does indeed validate that this is the case, using the length check as an Invariant check (that is a formula that must be true in every reachable state). However this does not prove that this alternates in every case. The consumer could be called twice (without the Producer being called) which would also be a failure.
There are a number of ways the specification could be amended to detect this, however this is a good time to introduce the temporal logic concept of weak fairness. Weak fairness on a given step implies that the action cannot be enabled forever (therefore other actions would apply). So in the above example we require the Consumer to respond to every Producer step. As such we can replace the Spec line with:-
Spec == /\ Init /\ [][Next]_vars
/\ WF_vars(Consumer)
This defines the Consumer as fair, that is, will always respond to every Producer event. Checking the model again confirms that is its still valid running in this state.
PlusCal
For those of you not from a mathematical background, TLA’s syntax can a first seem a little daunting. To address this, PlusCal was developed. This is a ‘C’ based language that is more developer friendly. PlusCal is first translated into TLA so it gives you an opportunity to pick up the TLA syntax overtime while working with a language that is easier to reason with.
For example the simple consumer / producer algorithm above in PlusCal looks like this:-
--algorithm Alternate {
variable b = 0; box = <<>> ;
process ( Producer = 0 )
{ p1: while ( TRUE )
{ await b = 0 ;
box := Put(box) ;
b := 1
}
}
fair process ( Consumer = 1 )
{ c1: while ( TRUE )
{ await b = 1 ;
box := Get(box) ;
b := 0
}
}
}
Summary
TLA requires a reasonable investment in time to reap rewards. There is a hyper book but it is written in maths parlance which is not intuitive to most software engineers. There are limited examples and help available and the toolset has some quirks and bugs (at the time of writing) you need to work around. PlusCal significantly eases writing TLA specifications but there is still a need to understand the underlying TLA specification generated to ensure no false assumptions are being made, especially around granularity and fairness.
Its limitations should also be understood. It allows algorithms to be ‘tested’ but the algorithm test parameters and assertions still need to be given careful thought. Whilst it will allow the effect of latency / failure to be modelled this has limitations and could not be used to model large systems with many possible feedbacks / back pressure and other stalling loops. In addition, whilst it can give an indication of the complexity of events that lead to failure it cannot, of its own, indicate the probability of such a failure.
Initial findings are positive, however to see how much ROI this would give would require more extensive use on a range of projects. Specifically its ability to identify issues that would not have been uncovered using other methods or to identify where more aggressive optimizations can be made (by more fully understanding the implications of these) would need to be proved.
Further reading
TLA+ Home page – http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html
TLA+ Tool set – http://research.microsoft.com/en-us/um/people/lamport/tla/tools.html
PlusCal language manual – http://research.microsoft.com/en-us/um/people/lamport/tla/c-manual.pdf