Skip to content

Design

1. Good Design Requirements

  • Scalability - System should grow with its user base
  • Maintainability - Ensure ease of future development and improvements
  • Efficiency - Optimal use of resources
  • Reliability - System remains stable when things go wrong

1.1. 3 Key Elements of System Design

⚙ Moving Data

  • Ensure data flows seamlessly between components (user, system, databases)
  • Optimized for speed and security

💾 Storing Data

  • Consider Access patterns, Indexing strategies, Backup solutions
  • Data should be readily available and stored securely

🔃 Transforming Data

  • Convert raw data into meaningful information
  • Aggregate log files for analysis
  • Convert user input into different formats

1.2. CAP Theorem A.K.A Brewer's Theorem

Assumes distributed system

  • Consistency - Ensures all nodes in the distributed system have the same data at the same time. Change in one node should also be reflected in all nodes.
    • Strong Consistency: All read and writes operations are guaranteed to be immediately consistent with each other, regardless of network delays or failures
  • Availability - System is always operational and responsive to requests regardless of partial node failure
    • High Availability: The system remains operational and responds to requests even in presetns of network or other system issues.
  • Partition Tolerance - Systems ability to continue functioning even if a network partition occur
    • Partition Tolerance: The system continues to operate, even when network communications between nodes is unreliable or fails.

Can only achieve 2 out of those properties at the same time

Can only achieve 2 out of those properties at the same time

  • CA - Prioritises both consistency and availability. They do not tolerate network partitions an always aim for strong consistency.
    • RDBMS
  • AP - Prioritise availability and partitioning tolerance over strict consistency. They systems may provide eventual consistency, where data may take some time to propagate and become consistent across all nodes.
    • Cassandra, DynamoDB
  • CP - Consistency is prioritised, even at the expense of availability. When network partition occurs, the system might choose to become temporary unavailable rather than risk delivering inconsistent data
    • MongoDB, HBase, Redis

Not about finding the perfect solution, it's about finding the best solution for our specific use case

1.3. Availability

Usually measured as 99.999%

Example

Running service with 99.9% availability, allows for 8.76 hour of downtime a year 365*24 * 0.001 = 8.76 Hours

99.999% only allows 5 mins of downtime a year

SLO-Service Level Objectives

Example. Can set that our service should respond to the request within 300ms 99.9% of the time

SLA-Service Level Agreements

More like format contracts with the users. They define a minimum level of service we're committing to provide. If your availability drops below our stated availability in the SLA, we might have to provide refunds.

1.4. Assessing System Resilience

We use the following criteria to assess system resilience:

  • 🛡 Reliability
    • Ensuring system works correctly and consistently
  • 💥 Fault Tolerance
    • How does system handle unexpected failures or attacks
  • 🔁 Redundancy
    • Having a system backup that takes over when part of the system fails
  • ⚡ Speed
    • Throughput - How much data our system can handle over a certain period of time
      • Server Throughput is measured in RPS (requests per second)
      • DB Throughput is measured in QPS (Queries per second)
      • Data throughput measured in B/s (Bytes per second)
    • Latency - How long it takes to handle a single request

Speed Optimization Trade-off

When it comes to optimising speed, it often affects other metrics. For example, increasing throughput by batching jobs will decrease latency.