Design
1. Good Design Requirements¶
Scalability
- System should grow with its user baseMaintainability
- Ensure ease of future development and improvementsEfficiency
- Optimal use of resourcesReliability
- System remains stable when things go wrong
1.1. 3 Key Elements of System Design¶
Moving Data
- Ensure data flows seamlessly between components (user, system, databases)
- Optimized for speed and security
Storing Data
- Consider Access patterns, Indexing strategies, Backup solutions
- Data should be readily available and stored securely
Transforming Data
- Convert raw data into meaningful information
- Aggregate log files for analysis
- Convert user input into different formats
1.2. CAP Theorem A.K.A Brewer's Theorem¶
Assumes distributed system
Consistency
- Ensures all nodes in the distributed system have the same data at the same time. Change in one node should also be reflected in all nodes.- Strong Consistency: All read and writes operations are guaranteed to be immediately consistent with each other, regardless of network delays or failures
Availability
- System is always operational and responsive to requests regardless of partial node failure- High Availability: The system remains operational and responds to requests even in presetns of network or other system issues.
Partition Tolerance
- Systems ability to continue functioning even if a network partition occur- Partition Tolerance: The system continues to operate, even when network communications between nodes is unreliable or fails.
Can only achieve 2 out of those properties at the same time
Can only achieve 2 out of those properties at the same time
CA
- Prioritises both consistency and availability. They do not tolerate network partitions an always aim for strong consistency.- RDBMS
AP
- Prioritise availability and partitioning tolerance over strict consistency. They systems may provide eventual consistency, where data may take some time to propagate and become consistent across all nodes.- Cassandra, DynamoDB
CP
- Consistency is prioritised, even at the expense of availability. When network partition occurs, the system might choose to become temporary unavailable rather than risk delivering inconsistent data- MongoDB, HBase, Redis
Not about finding the perfect solution, it's about finding the best solution for our specific use case
1.3. Availability¶
Usually measured as 99.999%
Example
Running service with 99.9% availability, allows for 8.76 hour of downtime a year
365*24 * 0.001 = 8.76 Hours
99.999% only allows 5 mins of downtime a year
SLO-Service Level Objectives
Example. Can set that our service should respond to the request within 300ms 99.9% of the time
SLA-Service Level Agreements
More like format contracts with the users. They define a minimum level of service we're committing to provide. If your availability drops below our stated availability in the SLA, we might have to provide refunds.
1.4. Assessing System Resilience¶
We use the following criteria to assess system resilience:
Reliability
- Ensuring system works correctly and consistently
Fault Tolerance
- How does system handle unexpected failures or attacks
Redundancy
- Having a system backup that takes over when part of the system fails
Speed
Throughput
- How much data our system can handle over a certain period of time- Server Throughput is measured in
RPS
(requests per second) - DB Throughput is measured in
QPS
(Queries per second) - Data throughput measured in
B/s
(Bytes per second)
- Server Throughput is measured in
Latency
- How long it takes to handle a single request
Speed Optimization Trade-off
When it comes to optimising speed, it often affects other metrics. For example, increasing throughput by batching jobs will decrease latency.