To be sorted
Calculate network bandwidth requirement
How to calculate network bandwidth requirements for a video sharing site like YouTube. Assume average video size is 100MB and 500000 users uploading 1 video per day. 5 million daily active users seeing 5 videos per day.
You will probably not upload all video to 1 DC and 1 host. How many DC and hosts will you use? Let’s assume we need this globally. So 4 availability zones, 2 DC each, 2 hosts per DC
Design YouTube view history
Design YouTube view history. Discuss system architecture, data storage etc.
What is the storage system you would use, why and how?
Sync data for large number of devices
Design of how data is synced when there are large number of devices and data is updating rapidly. Like we wear fitness bands, data is updated with every step in the band and then synced with the global storage.
What is the point of sync that on every step? There are multiple reasons of not to do that
- it's expensive
- the device doesn't have an internet connection all the time
- it consumes a lot of battery
- and probably the main reason: it is not needed to have such an latency on syncing that. it's not a HFT thing
Then how is data synced behind the scenes? What is the architecture? I'd upload the data (whatever precision is required) into a cell phone via bt and just send an butch update to the server. DB depends on requirements, might be wide column, or time series db
Consistent Hashing
Fanout Implementation
What's your thought for fanout implementation - listen change stream + lambda instead of a queue + consumer? consider you have high write throughput. I wanted to understand the limitation of Lambda. I got the answer. Example, timeline update for follower in twitter system. If you want to build listen change stream you need to use a queue internally + consumer.
Lambda
I dont recommend lambda for low latency, HA and high throughput usecases. Lambda cold starts are pretty well knows time consumers and you'd have to account to the latency addition (sometimes in seconds).
You won't have cold start with Provisioned Concurrency. You are gonna pay for it and it's not cheap. basically what PC does is to warm up some additional containers (more containers, more $), ready to serve a request. also say you have 100 containers ready and you get a burst call of 101 requests which makes the last call a cold-start one.
Other References
- https://databricks.com/glossary/lambda-architecture
- https://en.wikipedia.org/wiki/Lambda_architecture
Design Instagram vs Twitter
Can I say that system design for twitter vs instagram is the same except that twitter has 140 char limit? Rest seems to be same. user posts txt, img & video. Fanout to followers with celebrity as special case. Read heavy.. say 10 times of write.. eventual consistency is fine.
- follow part is totaly different in twitter and insta
- Instagram should be media heavy than Twitter
- The way news feed is computed is totally different in both. The main idea of any news feed is fanout.
In write heavy scenario. if we have more reads that writes its better to spend more time in write to prerate data so that read will be cheaper.
Kafka
It is worth noting that typically only one consumer within a consumer group is subscribed to a partition... [1] this is how Kafka achieves high message processing throughput. So, even in the case of multiple partitions, messages within a single partition are truly processed in the order they were sent.
Also worth noting is that partitions are specific to Kafka. For example, Google Cloud Pub/Sub doesn’t expose partitions to the user — they are there, but they are behind the scenes. [2].
In a systems design interview, it is sufficient to just talk about topics and subscribers. There is no need to go deeper than topics or to mention specific products. In fact, I am told by an insider that mentioning specific products is frowned upon at Google, which has its own version of all the open-source projects (and more).
[1] “Each partition is connected to at most one consumer from a group.” https://blog.cloudera.com/scalability-of-kafka-messaging-using-consumer-groups/
[2] “Partitions are not exposed to users.” https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8
- Confluent Build Services on a Backbone of Events
- Confluent Apache Kafka Supports 200K Partitions Per Cluster
- Kafka vs Redis Pub-Sub, Differences which you should know
- YT - Kafka: A Modern Distributed System
Distributed Consensus
To get Redis to be fault tolerant for this use case, you need to give up the idea of a partially synchronous system, or you need to sync up state using transactions, or you need a Redis master node. Basically to have consistent state in Redis is difficult and to get it, you need to give up a lot.
So one answer is a distributed consensus algo, where at least the majority of nodes agree on state. This eliminates a lot of the problems and possible bottlenecks that Redis might have.
But Replicated State machines, configuration stores, leader election, distributed locking are all very good use cases for distribute consensus. Distributed concensus algos, in general, allow you to cheat some of the problems that "normal" concensus patterns might have.
There's no free lunch of course, these algos have serious problems with round trip times. and other concerns. It's not some magic solution.
If that data can't change, it reduces the complexity and need for these algos.
https://sre.google/sre-book/managing-critical-state/
Open Questions
- LLD design for Log4J library
Really Random things
- https://engineering.fb.com/2021/04/05/video-engineering/how-facebook-encodes-your-videos/
- https://blog.dream11engineering.com/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33 - scaling for dimensionality, aggregated data, metrics after I collect the data. Collecting the data from devices. around 1.2 billions devices a day. Edge computing means at the device, some level of aggregations should happen and collect the data through HTTP through Kafka by topic
- https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-frequency-trading-application-600f7c04da94
- https://netflixtechblog.com/how-netflix-uses-ebpf-flow-logs-at-scale-for-network-insight-e3ea997dca96
- Cloudflare Cloudflare architecture and how BPF eats the world - incredible articles Cloudflare CDN Caching - Dynamic content is generated by scripts that change the content on a page. By running scripts in a CDN cache instead of in a distant origin server, dynamic content can be generated and delivered from a cache. Dynamic content is thus essentially "cached" and does not have to be served all the way from the origin, reducing the response time to client requests and speeding up dynamic webpages. Cloudflare Workers, for example, are serverless JavaScript functions that run on the Cloudflare CDN
- A Closer Look At Etcd: The Brain Of A Kubernetes Clusteretcd will almost always be the bottleneck because everything is stored there by API Server
- AirBnb - Avoiding Double Payments in a Distributed Payments System
- https://logz.io/blog/kafka-vs-redis/
- https://www.educba.com/redis-vs-kafka/
- YT Query Petabyte Scale Dataset on S3
- YT https://www.youtube.com/channel/UCZEfiXy7PmtVTezYUvc4zZw - 14 videos on this YouTube channel. This dude has nice videos, beginner friendly. The databases video summed it up pretty well, i would suggest it to beginners aswell who want a databases overview
- YT MIT 6.824 Distributed Systems (Spring 2020)
- YT Four Distributed Systems Architectural Patterns by Tim Berglund
- YT Designing Udemy's Taxonomy on SQL | System Design
- YT Cancel token
- YT - Airbnb Search Architecture
- YT - Consistent Hashing Rajeev - Basically, nodes are arranged in a ring structure (not actually a ring but conceptually). Each node is responsible for keys hashed between it and its predecessor in the ring. Node placement has nothing to do with hash function used or how keys are hashed- instead it’s based on data, mostly to spread data evenly across all nodes. A virtual node placed in the ring structure would relieve load on its successor.
- YT - why it is very hard to cancel an HTTP request
- YT - AWS re:Invent 2017: Architecting a data lake with Amazon S3, Amazon Kinesis, and Ama (ABD318) - Atlassian built a data lake from S3, Kinesis, Glue, and Athena
- YT - Episode 109: eBay’s Architecture Principles with Randy Shoup - a great podcast by Randy Shoup ( VP Engineering and Chief Architect at Ebay) on ebay Architecture principles
- YT - Streaming a Million Likes/Second: Real-Time Interactions on Live Video
- YT - Watermarks: Time and Progress in Apache Beam and Beyond - about event time ordering
- https://docs.microsoft.com/en-us/azure/architecture/best-practices/api-design - references for API design best practices. Very comprehensive collection of best practices
- https://docs.microsoft.com/en-us/azure/architecture/patterns/index-patterns
- Java Prep
- https://docs.oracle.com/javase/tutorial/
- https://github.com/javadroider/interview-prep/tree/master/interview
- https://netflix.github.io/atlas-docs/overview/
- https://leetcode.com/problems/flatten-a-multilevel-doubly-linked-list/
- https://leetcode.com/problems/median-of-two-sorted-arrays/ - pretty much needs a big trick/2 and even then the edge cases are hard to explain/code in the timeframe, esp without prev practice.
- U can share ur screen with sketchboard.io or digital pad+pencil works well with google jamboard. I simply share my screen on hangouts or zoom and diacuss on sketchboard.io If thats not possible then macbook sidecar with ipad and using this i can write or draw on any digital whiteboard. I have tried with jamboard.
- DDIA Notes
- DDIA notes - keyvanakbary
- DDIA notes - ibillett
- DDIA notes - xx
- DDIA Chapter 1
- https://chiragshah9696.medium.com/interviews-interviews-interviews-7407faf4c7cc
- https://betterprogramming.pub/top-30-apple-coding-interview-questions-with-solutions-19990071ebfc
- MongoDB
- Reddit - cons of mongodb
- HSBC moves to MongoDB
- end to end system design with data generation, data ingestion, data aggregation, and data representation
- XMPP is decentralized whereas WebSockets are centralized. XMPP works on the application layer whereas WebSockets are a transport protocol
- https://aws.amazon.com/blogs/compute/using-amazon-api-gateway-as-a-proxy-for-dynamodb/
- Leaderboard - Building real-time Leaderboard with Redis
- What Is Load Balancing? Types, Configurations, and Best Tools - DNSstuff
- Architecture blogs
- Fraud Detection - at the simplest, it's just a stream of events and some query or model looking over the data
- Developing a flash sale system | HackerNoon
- News Feed Ranking
- Stack Overflow How did WhatsApp achieve 2 million connections per server?
- Stack Overflow Elasticsearch index sharding explanation
- https://microservices.io/
- http://distributedsystemscourse.com/
- Push notifications
- https://developer.apple.com/documentation/usernotifications/setting_up_a_remote_notification_server/pushing_background_updates_to_your_app
- Concurrency & Parallelsim
- Overview of Modern Concurrency and Parallelism Concepts
- Educative Multithreading notes
- Introduction to Thread Pools in Java | Baeldung
- Java Concurrency and Multithreading Tutorial
- REST vs RPC
- gRPC vs REST: Understanding gRPC, OpenAPI and REST and when to use them in API design | Google Cloud Blog
- YT Distributed Systems lecture series
- Coding / OO problems
- Design Tic Tac Toe - establish a contract of a board that an implementer would fulfill
- Snake and Ladder HyperLogLog - probabilistic data structure
- https://leetcode.com/problems/design-search-autocomplete-system/
- https://www.oodesign.com
- https://github.com/iluwatar/java-design-patterns
- Bloom Filter
- Cassandra Bloom Filters
- Ribbon Filter
- https://serverfault.com/questions/238417/are-networks-now-faster-than-disks
- https://thenewstack.io/datadog-monitors-scalable-systems/
- Write a time-series database engine from scratch
- Datadog How to solve 5 Elasticsearch performance and scaling problems
- Optimizing Flipkart’s Serviceability Data from 300 GB to 150 MB in-memory
- Leadership Principles https://www.levels.fyi/blog/amazon-leadership-principles.html
- YT - Amazon Interrview Questions playlist
- https://druid.apache.org/
- https://linkerd.io/
- theoretical-maximum-number-of-open-tcp-connections-that-a-modern-linux
- Exclusive: a behind-the-scenes look at Facebook release engineering - explains a bit on how to copy 38 GB file located in US to all the 1000 Data Center located in China. All Data Center are loaded with work and using all bandwidth might affect the work. (BitTorrent like question)
- Sky Computing, the Next Era After Cloud Computing
- Sequence Generation in Cloud Spanner
- Stack Overflow Distributed Sequence Number Generation
- Redis INCR