Its been a while since my last post. Nevertheless, let’s get started again with new series of posts. In this section, we will begin new journey and see how to design system. Before designing any scalable system, what are the factors, we need to take care of? Let’s consider a scenario, where in we are designing hotstar like video ingestion system. How would we approach this part? What are the basic questions which we need take into account. Let’s look at these scenarios. Before, jumping to discussion, one point to note here that in system design problems, no solution is 100% correct and no solution is 100% wrong. These questions are open ended questions and answers can vary entirely on different scenarios presented. Hence, without wasting time, let’s get started.
High Level Design:-
- Above diagram explains the high level design where in hotstar can receive videos from different vendors.
- Hotstar client can then upload the same to some blob storage say S3.
- Then, after processing of those videos, all videos will be ready for rendering for users from different locations.
MVP (Minimal Viable Product) Requirements:-
- Upload video in any format
- Convert the video in say MP4. (Assume, this system is already there.)
- Quality of video > 1080p
- Generate multiple thumbnails. (Assuming Player will take care of this)
- Check for Policies
- Generate the video into lower resolutions so that 2g, 3g network can adapt to it.
- Support at-least 10M (Million) parallel hits once video is live.
Now, let’s go ahead and look at the scale estimation part.
- Let’s say Hotstar at the moment is having 1 M(Million) videos at the moment and which is subject to grow 25% YOY (Year on Year).
- Let’s assume, the size of the video can vary from 100 Mb to 10 GB for highest resolution. But, let’s consider on an average for 1 Gb. This will come around 1M*1Gb = 1 Pb (Peta byte). This is the storage requirement we have as on today.
- And, this storage size is subject to grow by 25% YOY. Means, it will increase like 1.25, 1.50 Pb …
- Having said that, we can’t store this much data in one instance.
- Also, one important point to consider here, video once made public, it won’t be retired, doesn’t matter how old it is. Unless there comes certain copyright issues.
Now, let’s go ahead and consider and see QPS.
QPS (Query Per Second):-
Based on the problem statement, we have to support 10M parallel reads at least for the same resource. Before understanding this, let’s assume, we have only one instance which is serving 10 M parallel requests, then what will happen?
- Its request (network) will choke.
- Harddisk will crash.
Now, before delving any further into QPS, let’s first have a high level look at most probable trade off for this.
Low Latency:- Since, we are making video ingestion system and latency is coming into picture when Hotstar user is uploading video to server, hence we can ignore latency here as this is not something which end users are going to experience. Few videos, can take milliseconds to upload, few can go upto hour. Therefore, its fine to ignore latency part here.
CAP:- Since, we are OK with latency in the system, hence based on CAP theorem, we are actually building Available system. Or we can further say that, this consistency can eventually become consistent over the period of time. Here, is one important point to note, our system is Available and Consistent both because we don’t have any partitions. Since, we have 1M videos at the moment and we are storing the location of video as metadata in our RDS (Relational database system), hence we can live without sharding hence, no partioning required.
Say, one video location is taking 1kb space.
Therefore, 1M*1Kb = 1Gb space for one year.
which will grow like 1.2, 1.4 Gb YOY.
Hence, based on above calculation one instance is more than enough to store this data. Therefore, our system is CA (Consistent and Available) till now.
Single point of failure (SPOF):-
When, we don’t have any partition in the system, then it will always have a threat of single point of failure. In order to get rid off SPOF, we can simply enable automated backups either snapshot model(async) or always update model (sync). Here, as soon as we will enable replication, we will be introducing multiple partitions. There are two ways of doing replication
- Snapshot Model:- Every 5 minutes take backup. This is aka async model. This kind of model is usually applied where in want always available model.
- Always Update Model:- Also known as in-sync as slave is always in sync with master. Hence, in case if master becomes available, then slave can be promoted as master. This kind of option, is usually applied in always consistent model.
Since, we need to make our system, always available, hence, we will go with first option, which also means now our system is AP (Available-Partition Tolerant).
Multiple Failure Scenarios:-
Let’s say process of ingestion involves multiple steps say s1 –> s2 –> s3 –> s4 –>s5. After completing these steps, it will be save in let’s say it will be available for public. Therefore, any of the steps from s1 to s5 can fail. On a daily basis 0/1 video can be uploaded on the platform, since this is not very frequent activity. But, let’s say any new series is getting released of 100 episodes, which needs to be uploaded as well.
Assume, any developer has written script to upload them all in one shot. But, let’s say hotstar server accepts only 10 videos at time, then 90 odd videos upload will fail in this case. Therefore, in order to handle this scenario, all these files can be added in the Queue. Once server done processing other videos, it can pick the queued ones.
- Here, first we will upload the video to S3.
- Next, take the s3 url and then update the same in the database.
- Next, take the job id which got created from the database and put the same in the queue.
- Then, one by one job Id, will be picked up dedicated servers and do the next processing.
Hence, above processing like upload video to s3 –> update the video url in the database –> putting the job id in queue. These three steps are sequential steps.Therefore, for queue, we can use Redis Queue or Kafka here.
Above explained processing is one way of deigning the system. There may be other good solutions to this problem like via workflow implementation. That will see in the next discussion. Till then stay tuned and Happy Coding.
Also published on Medium.