Page 1 of 2

OpenTTD.org downtime

Posted: 03 Dec 2010 19:00
by TrueBrain
(all times are in CET)

At the moment, the server hosting openttd.org and most of its related webservers, is offline.

A brief history of events:

0700 - Server goes down
1130 - I wake up (he, it was a busy night, don't judge me!), and got told about the outage
1150 - I inform LeaseWeb, our provider, that our server is down, and that a remote reboot doesn't help
1155 - Verbally I confirm my identity, and the issue is send to the engineers
1600 - I call in to check on progress. No engineer has been available, and the ticket has not been handled
1800 - With the weekend coming, I send them another email. No reply so far
2000 - We started to become a bit desperate. #openttdcoop guys have offered to host a temporary placeholder for us, which you can now enjoy. Masterserver is currently being hosted by Rubidium himself. Contentserver will remain offline, as we have little alternatives for that. Lets hope someone picks up our ticket soon, and pushes the 'on' button. Lets hope it really is that simple, and no real hardware damage happened
0100 - Send them another mail. Would at least expected a reply about an estimate when someone will be attending the ticket. Maybe the reply got lost. SMTP is not a 100% delivery protocol after all
0430 - Its alive. ALIVE. LeaseWeb emails me to let me know someone went down the machine and hit the power button. Now wtf went wrong?
0930 - Rubidium wakes up. Boots all services that didn't autoboot for one reason or the other
1130 - I wake up (he! Sue me). Investigating wtf happened ...
1145 - LeaseWeb logs in to our server to investigate with us
1215 - LeaseWeb concludes the same as we did: nothing seems wrong. Strange

Things you might want to know:

2 weeks ago, we had our first outage in 2 years. This is quiet impressive. 99.99% uptime for 2 whole years. I am pretty proud on that. As it goes with most outages, you reboot your server, and you continue.
Few days later, bam. Server down again. This time, remote reboot didn't help. It turned out that our server was shut down. Rebooting doesn't help for this. Nothing in the logs suggest any of this, and we are clueless to what happened. But, as it goes with computers, it happens. We checked the hardware as far as we can (HDDs, memory, ...). Nothing pops up. So .. we hope it is a one-time event, and continue.
Today, bam. Server down again. What the f*** .. this sounds more serious. Normally a NIC still replies to pings when a kernel panic happens. Nothing. Hmm. We should start to worry I guess. Therefore I also requested LeaseWeb, our provider, to look into this. Maybe there are hardware failures? I don't know.

After 7 hours, no engineer has looked at our server yet. Most likely they are really busy. Sadly enough, this means that most webservice are not working. With the weekend coming, we can only hope they look at it in the next few hours. I will at least try to keep you up-to-date. For now, you will have to do with lesser openttd.org. Hope you survive.

"Wait!", you say, "why don't you have mirrors!". Although we have our binaries mirrored, mirroring a highly dynamic website is not easy. Also, with 2 years of 99.99% uptime, there has never been a reason. Things like masterserver and contentserver are also really hard to mirror. So they never have been. Of course this is to us a wakeup call for us to start looking at options here.

"Wait!", you say, "but is the OpenTTD binaries and sourcecode safe?". Yes. All binaries are mirrored. All SVN data is also mirrored (to secret places :D), to more than one place. So OpenTTD itself as project is safe. Always. We made damn good sure of that, when we lost our SVN 5 years ago ;)

I hope you can bear with us. At least I wanted to let you guys know that we know of the problems, but that it is at the moment out of our control. I hope I can give you a more positive update very soon. Bear with us.

[Update 20:10] Temporary website is in the air: http://fallback.openttd.org/
[Update 11:30] All services are restored back to normal. Total downtime: 22 hours.
[Update 12:30] LeaseWeb concludes the same as we did: nothing seems wrong with the hardware, and it was not a clean shutdown. How the machine then ever got powered off completely .. well, we close the investigation for now, in the hope this doesn't happen AGAIN.
[Update 12:40] LeaseWeb outlines they are very sorry it took this long to fix the problem. It has been incredibly busy there. Happens, as with everything. Happens. At least it is fixed now :D

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 19:08
by Arie-
Anyone who wants to download something OpenTTD related, you'll have to browse the directories yourself, but still it works: ftp://ftp.snt.utwente.nl/pub/games/openttd/binaries/

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 20:21
by Kogut
TrueBrain wrote:"Wait!", you say, "but is the OpenTTD binaries and sourcecode safe?". Yes. All binaries are mirrored. All SVN data is also mirrored (to secret places :D), to more than one place. So OpenTTD itself as project is safe. Always. We made damn good sure of that, when we lost our SVN 5 years ago ;)
What with bugtracker?

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 20:47
by TrueBrain
Kogut wrote:
TrueBrain wrote:"Wait!", you say, "but is the OpenTTD binaries and sourcecode safe?". Yes. All binaries are mirrored. All SVN data is also mirrored (to secret places :D), to more than one place. So OpenTTD itself as project is safe. Always. We made damn good sure of that, when we lost our SVN 5 years ago ;)
What with bugtracker?
Bugtracker, wiki and the others do not have backups. They are not primary to the project's survival. I guess the wiki can be considered primary, and I guess I will start backups for it soonish. No clue why I never did that before tbh .. :)

But no worries, the server will be back online :) And as all data is in RAID-Mirror, a lot has to go wrong for it to be lost :) My remark was only meant for those long enough here to still remember the last 'more than 6 hour unannounced outage', which resulted in days without server, and no SVN in the end :p This will not happen :)

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 21:22
by Alberth
TrueBrain wrote:which resulted in days without server, and no SVN in the end :p This will not happen :)
Even if all your backups fail, there are a lot of people that have a pretty much complete hg or git clone.

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 22:42
by orudge
Arie- wrote:Anyone who wants to download something OpenTTD related, you'll have to browse the directories yourself, but still it works: ftp://ftp.snt.utwente.nl/pub/games/openttd/binaries/
There are also mirrors available at:

http://us.binaries.openttd.org/binaries/
http://gb.binaries.openttd.org/binaries/
http://hu.binaries.openttd.org/binaries/

I think those are all, but I may have forgotten others!

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 23:29
by SHADOW-XIII
TrueBrain wrote:0700 - Server goes down
1130 - I wake up (he, it was a busy night, don't judge me!), and got told about the outage
have you considered using Nagios? I got nagios sending me email/text* when server is not accessible for some time (20min I think) from other server
and configured my phone to keep ringing like crazy if got the message (even at night) :) ... although hours could be adjusted freely, in my case the website has to go 24h/day

this time wouldn't help but generally speaking, you could have nagios monitor not a server as a whole but single processes as well in case just apache/sql dies

*using premium service, mail to text

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 23:31
by Lord Aro
So basically, we (or you ;) ) are in fairly deep s***, but not quite as deep as 5 years ago. Correct?

Also, when did downtime actually occur? In the first post it states 7:00 CET, but i noticed it before 7:00 GMT and reported it on irc at around 7:10, as SmatZ will testify ;)

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 23:42
by TrueBrain
SHADOW-XIII wrote:
TrueBrain wrote:0700 - Server goes down
1130 - I wake up (he, it was a busy night, don't judge me!), and got told about the outage
have you considered using Nagios? I got nagios sending me email/text* when server is not accessible for some time (20min I think) from other server
and configured my phone to keep ringing like crazy if got the message (even at night) :) ... although hours could be adjusted freely, in my case the website has to go 24h/day

this time wouldn't help but generally speaking, you could have nagios monitor not a server as a whole but single processes as well in case just apache/sql dies

*using premium service, mail to text
I do this for free, in my free time, because I like it. Hell no I am going to let it interrupt my sleep :D Sorry ....

I already have those interruptions in my profession ;) (I work for an ISP :p Not LeaseWeb, mind you ;))

Re: OpenTTD.org downtime

Posted: 03 Dec 2010 23:43
by TrueBrain
Lord Aro wrote:So basically, we (or you ;) ) are in fairly deep s***, but not quite as deep as 5 years ago. Correct?

Also, when did downtime actually occur? In the first post it states 7:00 CET, but i noticed it before 7:00 GMT and reported it on irc at around 7:10, as SmatZ will testify ;)
Nah, we are not in deep s*** in any way. We just have to wait for an engineer to pick up our ticket, walk to the machine, and figure out what is wrong. I was just trying to be funny, which clearly failed ;)

7:00 CET is 6:00 GMT, so I guess that is possible. I don't do exact time. I don't care :p It was early morning .. do we really need to nitpick about the exact time it happened, and under which conditions? I can tell you it was snowing outside here at the time it happened :p Ghehe :) No, but seriously dude, nobody cares if it was 0600, 0700, 0800, or any time in between :) It went down. That matters ;)

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 00:03
by Lord Aro
Ok, that makes sense. I was tninking that 7:00 CET was 8:00 GMT
But, as you say, that doesn't matter...
Now all that needs to be done is to get an engineer playing OTTD...fixed in no time! :D (yes i do realise that isn't how it works)


N.B. I was also trying to be funny, but that obviously didn't get through either ;)

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 00:07
by TrueBrain
Humor on internet is a tricky thing ;)

And yeah ... we now need someone who works at Leaseweb who says: "WHAT? 12 hours and no reply? That can't happen to OpenTTD!" And fixes it :D

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 00:25
by kamnet
12 hours and no reply? Sheesh, I thought my hosting provider was bad. :-/

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 01:31
by XeryusTC
Well, you can't really complain about it that much because AFAIUI hosting is basically free. Even though even in that case it is quite nasty that they let such a thing go on for such a long while without a proper response. I guess that there are bigger things going on at LeaseWeb than just the OTTD server ;)

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 04:40
by orudge
XeryusTC wrote:Well, you can't really complain about it that much because AFAIUI hosting is basically free.
It's far from free, hence why OpenTTD has had fundraising drives in the past. While LeaseWeb do sponsor the hosting, we still pay a good chunk for it. And, even if they did nothing else, one would hope they would be able to respond to a simple server restart request within a reasonable timeframe.

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 05:03
by kamnet
XeryusTC wrote:Well, you can't really complain about it that much because AFAIUI hosting is basically free. Even though even in that case it is quite nasty that they let such a thing go on for such a long while without a proper response. I guess that there are bigger things going on at LeaseWeb than just the OTTD server ;)
I disagree, you can most certainly complain when you're not getting responses. Even a simple "we're up to our necks in thicknet and fighting off battle bots" would be a nice acknowledgment.

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 08:59
by ChillCore
TrueBrain wrote: ... 99.99% uptime for 2 whole years ...
While yesterday's events were a little annoying, that is a pretty stable service.


Anyway, the server is back. YAY. :)

Edit: spelling

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 10:29
by TrueBrain
All services are restored as of around 0930. Investigation still on its way.

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 11:15
by Lord Aro
W00p w00p! :mrgreen:
Now we wait patiently to see what the problem was...
Could it be something as simple as some engineer hitting the power switch when doing something else nearby?

Re: OpenTTD.org downtime

Posted: 04 Dec 2010 11:42
by TrueBrain
Final Update:

Both we as LeaseWeb have no clue what happened. As far as can be told without rebooting the machine, everything is how it should be. If it might happen again, they will shut down the machine completely, and take a better look at it. For now we are just happy it is back online again ;)

Software diagnostics only take you so far. Let's hope it was a cosmic ray that hit our server.

Either way, I would like to thank LeaseWeb for their time to look at the problem too.