One of the most common recurring problems in scaling your servers is deciding how many servers or how many cores you will need. Do I really need this many servers or is there something wrong with the way I implemented the server Architecture? Remember how flipkart.com failed to scale their servers. Horizontal vs Vertical vs Hybrid scaling is another important consideration. Is there a quantitative way you can make this decision? In this article I have tried to represent a mathematical model to fit into a web service request-response queuing mechanism which would answer these questions. I have used the Erlang–C model to compute the queue length and waiting times, which in turn can be used to fit the right number of servers required to serve a certain number of users. Before I pour a reservoir of details down your throat I will give you an insight into the various parameters that are involved. Note, I have curtailed all the details in arriving at the final results in order to keep the post short.
Mean request rate: λ: This is really dependent on the number of users and it very much follows a poisson distribution of request arrival (let’s say number of requests per hour).
Mean Service rate: µ: This is the number of service requests that your system can deliver. It really depends upon how long a service request takes. Imagine a typical search request might take O(nlogn )complexity and depending on the number of entries(n), you can compute the mean service time (this also has a certain O(n) time factor which might be directly dependent on the bandwidth costs which might be a major factor in some cases). A better way to measure this is by running a bunch of performance tests on your server to directly measure the mean response time and completely get rid of the bandwidth criteria.
Number of servers: S: For horizontal scaling. Let’s call S the horizontal scaling factor and µ the vertical scaling factor.
Pn → Probability of n people waiting at any time
L → Expected wait queue Length in the same units as that of Λ
W → Expected wait time in the same units as the frequency of W = L / Λ
We can arrive at the probability that no queue is maintained as:
Observant readers might have already smelled a bivariate Poisson distribution in: λ, µ and we can generalize it to n :
Now the expected waiting queue length can be derived from the first discrete expectation of n over the probability distribution.
The computation of P0 is an O(nlogn) algorithm, on the other hand to fit the right number of servers to reduce the value of L down to a given required value would be a direct binary search In O(logn) time. Overall to find out the number of servers required to efficiently run your state system could be computed in O(nlog2 (n) ) time.