||1. Distributed Systems: Concepts & Design, 5th Edition (2011) by George Coulouris, Jean Dollimore, Tim Kindberg, Gordon Blair; Publisher: Addison – Wesley; ISBN: 978-0132143011
2. Distributed Systems: An Algorithmic Approach by Sukumar Ghosh; 1st Edition (2006). Publisher: Chapman and Hall/CRC (2006); ISBN: 978-1584885641.
3. Introduction to Reliable and Secure Distributed Programming 2nd Ed. (2011) by Christian Cachin, RachidGuerraoui, Luis Rodrigues; Publisher: Springer (2011); ISBN: 978-3642152610.
||Metrics for availability & performance. SLAs
A brief review of the two important theorems related to distributed systems – FLP and CAP Theorem and what they mean to real systems.
Linearizability, Consistency, Serializability – Types of consistency: Weak/Eventual/Strong/Causal/FIFO consistency. Locks Protocols vs Leases. Problems of transactional systems. Highly available transactions between data centers.
Uses of real and logical time for conflict resolution and maintaining causal histories. Time: real and vector clocks, version vectors. Facebook’s Cassandra database. Flexible consistency. Google TrueTime. Real clock time using atomic clocks & interval arithmetic.
Fault Tolerance Patterns – A taxonomy of patterns (Architectural, Failure Detection, Error Recovery Error Mitigation and Faulty Treatment
Study scenarios that are used at scale. Taxonomy, differences between database replication and distributed system replication. Transactional replication vs. state machine replication. Primary/backup. Sync/Async and atomicity guarantees. Case studies: Riak, Microsoft Azure, Amazon SimpleDB, Google Datastore, Sinfonia.
Preventing data divergence
Consensus algorithms and state machine replication. Basic Paxos, issues, variety of implementations, Raft consensus algorithm: a vast improvement on Paxos. Chain replication.
Dynamic membership changes
What defines a cluster, and how are the services and data migrated to new machines. This section is about managed membership changes.
A look at how Yahoo and Google coordinate services using low level services. Case study: Zookeeper. Used by Yahoo, it provides services like publish/subscribe, looks hierarchical naming and events on state changes.
The challenges and economies at scale of magnetic disks, SSDs and in-memory systems. Latency, failure rates, MTBFs and power analyses.
Monitoring – Tracing and logging at scale