dinner, drinks and distributed snapshots
in an attempt to understand my coursework better, i plan on writing about things i'm learning -> starting with distributed snapshots
introduction
in my distributed systems class at uiuc, we recently learnt about the chandy-lamport algorithm - a snapshot algorithm that attempts to capture the global state of a distributed system
the story behind how they came up with the algortihm is super fun, and reason behind the title of this post. in lamport's own words -
The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University of Texas in Austin. He posed the problem to me over dinner, but we had both had too much wine to think about it right then. The next morning, in the shower, I came up with the solution. When I arrived at Chandy's office, he was waiting for me with the same solution.
what even is a global snapshot?
a global snapshot is simply the captured state of every process in a distributed system along with the state of each possible communication channel
by state of a process, we mean things like heap, program counter, registers, core dump, etc. and by state of each possible communication channel we mean capturing any messages between all possible pairs of processes in a system
why do we need it?
snapshots provide a consistent view of the state of a distributed system, enabling things like -
- checkpointing
- debugging
- garbage collection
sounds easy?
in a centralized system snapshots are simple-
every 'x' seconds, dump the state of the system and you're good to go
in a distributed system unpredictable latency and the lack of a single global time source make it impossible to sync clocks across any processes over a network
thus, we have no way to tell every process to take a snapshot at the exacy same time
defining the system model
before we move on to see what the hungover pair of chandy and lamport cooked up, it would be good to define the state of the system
proceses in the system are called
Pi, Pj
, etc.channels between two proceses
Pi, Pj
areCij and Cji
that is, there are 2 channels per pair of processes -
messages from
Pi
-->Pj
are on channelCij
messages from
Pj
-->Pi
are on channelCji