I am going to start a new term: The Yogscast Effect (in memory of The Slashdot Effect). Read on why:
Last few hours I have been poking around our services. The lovely thing about a heavy load is that it means you can finally try out those things you always wanted to try out, and see how the traffic flows. Sadly it also means sometimes you do something wrong


Some more fun statistics:
- on an average day our front httpd handles around 7 hits/s. During EU peak this is more, during EU night this is less. (normal EU profile). This is around 700k hits per day.
- during releases we mostly reach around 40 hits/s, where in December 2009 (our xmas release I believe?) we got values like 1.5M hits per day.
- over the last year we grew from around 5 hits/s to 7 hits/s, going from around 500k hits per day to 700k hits per day.
- in a month we handle around 20M hits, and have 0.5M ~ 1.0M unique visitors.
The last few weeks I have been doing a lot of work on the background to make our services more accessible (speed-wise). Over time MediaWiki has been using more and more CPU to render a page, and we got more and more page hits. So I move around stuff to be more separated, and MediaWiki for example runs in its own box. Days like these remind me why I do that. Of course normally you also have a much faster page, but nobody really notices it. With 700k hits per day I can only hope a normal sane system can handle that without people having the idea of 'lag'.
Today (well, yesterday, stats are based on yesterday

- an average of 46 hits/s. During EU peak more, but also during US peak. Our normal profile is totally whacked.
- because of the release the 5th and yogscast, we had 3.5M hits yesterday.
- I have measured peaks (5 minute window) of 250 hits/s and (1 hour window) of 80 hits/s.
- in the last 2 days we had 0.1M unique visitors. That would mean 1.5M if it continues (doubtful, but just for the sake of comparing values).
- this is higher than the slashdot post a few months back! (of course we were also smaller than, so it is comparing apples with oranges a bit ..)
On some parts of the site, the requests are tenfolded that of normal. Then of course we have many people downloading all files from BaNaNaS (why?), others who are spidering our complete website (again: why?), but many many many many many many legit users.
Like said above, for me this was a nice moment to start toying with a few things. It is hard to tune a service when you cannot really produce a load over it. Of course I often run DoS like applications to fake a load, but nothing beats a human. Today showed exactly that. I mostly worked on our load balancer for binaries.
To serve you binaries, you connect to http://binaries.openttd.org/, and from your point of view, you just get the file. In reality you get a redirect to one of our mirrors (which are so kindly donated by people. See http://www.openttd.org/en/contact for more on who in which country). When I build this load balancer, I used the following formula:
(The distance between each mirror and your geolocation as obtained via GeoIP) * (current 'load' on server), sorted by this value, take the lowest.
'Load' of a server is just the amount of GB flowing through a server in a window. The window was kinda large, so our load was normally around 20 or so. I knew this formula was broken, mostly as it was possible for US to get overloaded, and which would start to offload binaries to Europa. A very unwanted effect. But also needed to avoid lot of traffic on one mirror.
Today I asked Zernebok (the hoster of this forum) if they would mind an increased load on their mirror as after all we were getting a lot more US people all of a sudden. He was fine with it (tnx again!), so I fiddled with the balancer to balance better. I also noticed that GB (the country) had an unusual load, so I needed to look into that too. The latter turns out to be people from countries like Spain. The closest mirror we have for them is funny enough GB. This is network-wise not efficient, but geo-located the shortest route. Funny how that goes.
I changed the formula to the following:
(the distance between each mirror and your geolocation as obtained via GeoIP) ** (current 'load' on server), sorted by this value, take the lowest.
Mind the **. This means: power of. So the values EXPLODE. Of course with a load of 20 this is unacceptable. So .. I decreased the window of the load by a load, and keep the value around 2. This gives me a much better result, and one I would expect.
People who connect from the US have a high chance of hitting the US mirror. But if he gets loaded a lot for a short window, it does get offloaded to Europa. Perfect!
Also, within Europe routing is better. People from Spain now balance between GB, NL and DE, depending on their load. It scales so much better

Of course this took a bit of trial and error, and sadly I managed to hog all connections for around 8 minutes






My next point on the agenda, which is showing the first signs of stressmarks, is our main website. Due to old techniques used, and a f*** up deployment, it runs in a single thread and does not work with any form of cache. An average page takes 10ms to generate, which will tell you immediately it cannot serve more than 100 hits per second. We are not close to that yet (peaks of 50 hits per second I measured, as you might understand most of the hits go to our wiki, not our main page

I would like to close with: BRING .. IT .. ON !!
PS: sorry for the wall of text; you have XeryusTC to thank for that!
PPS: for those who laugh at the (in their eyes) low hits per day, I only have to say: I pity the fool. (and yes, you have to do the voice when you read this).
PPPS: one of the most under appreciated jobs is being a SysOp. If it works, nobody complains. If it fails, everyone complains.
PPPPS: btw, the same holds for people who collect your garbage every week.