dinner, drinks and distributed snapshots


in an attempt to understand my coursework better, i plan on writing about things i'm learning -> starting with distributed snapshots



introduction


in my distributed systems class at uiuc, we recently learnt about the chandy-lamport algorithm - a snapshot algorithm that attempts to capture the global state of a distributed system

the story behind how they came up with the algortihm is super fun, and reason behind the title of this post. in lamport's own words -

The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University of Texas in Austin. He posed the problem to me over dinner, but we had both had too much wine to think about it right then. The next morning, in the shower, I came up with the solution. When I arrived at Chandy's office, he was waiting for me with the same solution.


what even is a global snapshot?


a global snapshot is simply the captured state of every process in a distributed system along with the state of each possible communication channel
by state of a process, we mean things like heap, program counter, registers, core dump, etc. and by state of each possible communication channel we mean capturing any messages between all possible pairs of processes in a system


why do we need it?


snapshots provide a consistent view of the state of a distributed system, enabling things like -

  1. checkpointing
  2. debugging
  3. garbage collection



sounds easy?


in a centralized system snapshots are simple-
every 'x' seconds, dump the state of the system and you're good to go

in a distributed system unpredictable latency and the lack of a single global time source make it impossible to sync clocks across any processes over a network
thus, we have no way to tell every process to take a snapshot at the exacy same time

defining the system model


before we move on to see what the hungover pair of chandy and lamport cooked up, it would be good to define the state of the system

proceses in the system are called

Pi, Pj

, etc.
channels between two proceses

Pi, Pj

are

Cij and Cji



that is, there are 2 channels per pair of processes -
messages from

Pi

-->

Pj

are on channel

Cij


messages from

Pj

-->

Pi

are on channel

Cji




chandy-lamport algorithm (finally)

...
;