[Devblog] Server maintenance -- Finished
Moderator: OpenTTD Developers
[Devblog] Server maintenance -- Finished
Hi guys,
Today I write you about some SysOp work that will take place in the next month. For those that are interested, I will try to keep this updated a bit how things are going etc.
So what is going on?
For a while now we are running XenServer 6.2, and with 6.5 out, we would like to upgrade. Mainly because Jessie is not supported by 6.2 (PyGrub issues), and we really would like to upgrade all our VMs to Jessie over time. This will also allow us to use Docker, which would improve the CompileFarm greatly. Enough reasons to upgrade, basically.
Sadly, upgrading XenServer to a newer version is far from trivial, even more so if you run on a single server. When you have real access to a machine it is possible, but without it, nearly impossible. So to resolve that hurdle, we decided to temporary add a second machine to our cluster. That way we can group up both servers, migrate everything from one to the other, upgrade the old, and migrate everything back. Because OVH is awesome, getting another identical machine is very easy, so a plan was crafted shortly after.
The rundown of what will happen, and basically our internal information we passed around is here:
https://devs.openttd.org/~truebrain/upgrade-6.5.txt
As you can see, a lot of time is put into planning ahead before we attempt anything like this.
Today we installed the second server, which means from today we have a month to get everything setup (because, as you might have guessed, we only have the second server for a month ). I will update this post with the progress and the status etc.
If at any point you notice any issue which might be related to this plan, please do give a shout. If you have any suggestions how we can do this better, please do tell!
2015-09-05:
- Installed XenServer 6.5 on MachB
- Issue one: systems don't want to join the same pool:
-- OVH only delivers XenServer 6.5 templates
-- a 6.5 host cannot join a 6.2 pool (so MachB cannot join a pool from MachA)
-- a 6.2 host cannot join a 6.5 pool while having VMs running (MachA has VMs running, and that is exactly what I want/need)
-- Basically there is no way around this; falling back to export / import of VMs twice I guess ..
- Created GRE between both hosts, giving access to the same internal network on both hosts
-- Tested communication between VMs - all seem to be working; we now have 1 internal network over two hosts
- Upgrade 6.5 host to latest
- Initial tests migrating a single VM with minimal amount of downtime
-- Snapshot the VM
-- Export snapshot, copy to MachB, import snapshot
--- Works fine, but not fileless (XenServer no longer supports vm-import from stdin, sadly enough); this means I need enough diskspace to export/import each VM
-- Shut down of VM on MachA
-- Mounting disk on dom0 on both machines, to rsync the remaining files after shutdown
-- Start of VM on MachB
-- Network and everything is working and functional, all files arrived safely on new machine, including those created after the snapshot
--- This means the downtime is only as long as the rsync takes to sync the final changes
-- Initial tests successful finished. Will have to write a few scripts to speed up the rsync steps. Requires 9 commands per machine, with easy-made typos.
2015-09-06:
- Wrote the scripts to automate most part of it
-- A script that mounts a VDI to dom0, and unmounts after pressing enter
-- A script that creates a snapshot and exports it to a remote system
-- Rsync command do to the latest rsync
- Testing the move with a test VM
- All preflight checks are green; guess tomorrow I will move the first real VM, and see what happens.
2015-09-07:
- First real VM move
-- Exported "OpenTTD - Devs" VM from MachA (50GB disk, 21GB snapshot on disk)
-- Imported "OpenTTD - Devs" VM on MachB
-- Shut down "OpenTTD - Devs" VM on MachA
-- Rsyncing latest changes
-- Booting "OpenTTD - Devs" VM on MachB
-- Up&running; all seems to be working correctly
2015-09-11:
- Transfering our biggest VM (the oldest of the bunch): 600GB of data. This will take a few hours.
2015-09-12:
- 300GB at 23 MB/s, took ~4 hours, but the VM arrived on the new machine
-- This machine is the hardest of all, as it has VMs inside (and uses LVM and XFS)
-- Importing the machine
-- Booting without network
-- Change primary IP address, remove auto-start of VMs inside
-- Reboot with network
-- Shut down first internal VM
-- Rsync latest changes
-- Start up on new machine
-- Repeat for the other 2 VMs
-- (This VM is one that should be shut down, but it still is a critical part of the infrastructure; how that goes with these kinds of gradual upgrades, there is always something that cannot be "just" migrated away)
- Migrating the VMs for the CF
-- Total of 5 VMs
-- They can be shut down and moved; a lot easier and faster
2015-09-13:
- Moving "OpenTTD - Content"
-- This is one of the bigger VMs (200+GB); contains all the binaries and stuff
- Moving "General - LDAP"
-- This handles all authentication
- Moving "General - MySQL"
- Moving "OpenTTD - Django"
-- Unused at the moment
- Moving "OpenTTD - Jira"
-- This is merely a test machine
- Moving "Proxy - Email"
-- This is our anti-spam email proxy
- Moving "OpenTTD - FlySpray"
- Moving "OpenTTD - MediaWiki"
- Moving "OpenTTD - VCS"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-16:
- Moving "Gateway"
-- Using FailoverIP from OVH to migrate IP to new machine
-- Turns out it only works for IPv4; seems IPv6 will be more tricky
-- Seems OVH doesn't allow IPv6 routing without an IPv4 assigned to the VM (well, a Mac, but that you cannot do without IPv4); left 1 IP on the old machine (dedicated IP for email traffic)
-- When MachA is being reinstalled, IPv6 will be unreachable
2015-09-19:
- Shutting down "Gateway" on MachA
- Reinstalling MachA with XenServer 6.5
-- Via OVH CP, does all the magic for us
-- Because the mailserver will be unreachable, installing a SSH key so I can reach the server without the password
-- Install all the updates
-- Configuring new server (match the networks mainly)
-- Install GRE bridge for combined internal network
-- Join the other server in the pool
-- Make MachA master
- Moving "Gateway"
-- Copy "Gateway" to MachA
-- Mount it in dom0, and change configuration
-- Launch it; email and IPv6 connectivity restored
-- Moving IPs back to MachA
-- Moved every IP on their own interface; firewall can now be specific per IP what to allow based on interface
- Moving "OpenTTD - Django"
- Moving "OpenTTD - Jira"
- Moving all CF related VMs
- Moving "OpenTTD - Devs"
- Moving "OpenTTD - VCS"
- Moving "General - LDAP"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-26:
- Moving "MySQL"
- Moving "Old Machine"
-- Can't do it how I did the others; back to the old track
-- Export (~3 hours later)
-- Import (~5 hours later)
-- Sync disk data
-- Start machine
2015-09-27:
- Moving "OpenTTD - Content"
-- Same as above
- Finished!
Today I write you about some SysOp work that will take place in the next month. For those that are interested, I will try to keep this updated a bit how things are going etc.
So what is going on?
For a while now we are running XenServer 6.2, and with 6.5 out, we would like to upgrade. Mainly because Jessie is not supported by 6.2 (PyGrub issues), and we really would like to upgrade all our VMs to Jessie over time. This will also allow us to use Docker, which would improve the CompileFarm greatly. Enough reasons to upgrade, basically.
Sadly, upgrading XenServer to a newer version is far from trivial, even more so if you run on a single server. When you have real access to a machine it is possible, but without it, nearly impossible. So to resolve that hurdle, we decided to temporary add a second machine to our cluster. That way we can group up both servers, migrate everything from one to the other, upgrade the old, and migrate everything back. Because OVH is awesome, getting another identical machine is very easy, so a plan was crafted shortly after.
The rundown of what will happen, and basically our internal information we passed around is here:
https://devs.openttd.org/~truebrain/upgrade-6.5.txt
As you can see, a lot of time is put into planning ahead before we attempt anything like this.
Today we installed the second server, which means from today we have a month to get everything setup (because, as you might have guessed, we only have the second server for a month ). I will update this post with the progress and the status etc.
If at any point you notice any issue which might be related to this plan, please do give a shout. If you have any suggestions how we can do this better, please do tell!
2015-09-05:
- Installed XenServer 6.5 on MachB
- Issue one: systems don't want to join the same pool:
-- OVH only delivers XenServer 6.5 templates
-- a 6.5 host cannot join a 6.2 pool (so MachB cannot join a pool from MachA)
-- a 6.2 host cannot join a 6.5 pool while having VMs running (MachA has VMs running, and that is exactly what I want/need)
-- Basically there is no way around this; falling back to export / import of VMs twice I guess ..
- Created GRE between both hosts, giving access to the same internal network on both hosts
-- Tested communication between VMs - all seem to be working; we now have 1 internal network over two hosts
- Upgrade 6.5 host to latest
- Initial tests migrating a single VM with minimal amount of downtime
-- Snapshot the VM
-- Export snapshot, copy to MachB, import snapshot
--- Works fine, but not fileless (XenServer no longer supports vm-import from stdin, sadly enough); this means I need enough diskspace to export/import each VM
-- Shut down of VM on MachA
-- Mounting disk on dom0 on both machines, to rsync the remaining files after shutdown
-- Start of VM on MachB
-- Network and everything is working and functional, all files arrived safely on new machine, including those created after the snapshot
--- This means the downtime is only as long as the rsync takes to sync the final changes
-- Initial tests successful finished. Will have to write a few scripts to speed up the rsync steps. Requires 9 commands per machine, with easy-made typos.
2015-09-06:
- Wrote the scripts to automate most part of it
-- A script that mounts a VDI to dom0, and unmounts after pressing enter
-- A script that creates a snapshot and exports it to a remote system
-- Rsync command do to the latest rsync
- Testing the move with a test VM
- All preflight checks are green; guess tomorrow I will move the first real VM, and see what happens.
2015-09-07:
- First real VM move
-- Exported "OpenTTD - Devs" VM from MachA (50GB disk, 21GB snapshot on disk)
-- Imported "OpenTTD - Devs" VM on MachB
-- Shut down "OpenTTD - Devs" VM on MachA
-- Rsyncing latest changes
-- Booting "OpenTTD - Devs" VM on MachB
-- Up&running; all seems to be working correctly
2015-09-11:
- Transfering our biggest VM (the oldest of the bunch): 600GB of data. This will take a few hours.
2015-09-12:
- 300GB at 23 MB/s, took ~4 hours, but the VM arrived on the new machine
-- This machine is the hardest of all, as it has VMs inside (and uses LVM and XFS)
-- Importing the machine
-- Booting without network
-- Change primary IP address, remove auto-start of VMs inside
-- Reboot with network
-- Shut down first internal VM
-- Rsync latest changes
-- Start up on new machine
-- Repeat for the other 2 VMs
-- (This VM is one that should be shut down, but it still is a critical part of the infrastructure; how that goes with these kinds of gradual upgrades, there is always something that cannot be "just" migrated away)
- Migrating the VMs for the CF
-- Total of 5 VMs
-- They can be shut down and moved; a lot easier and faster
2015-09-13:
- Moving "OpenTTD - Content"
-- This is one of the bigger VMs (200+GB); contains all the binaries and stuff
- Moving "General - LDAP"
-- This handles all authentication
- Moving "General - MySQL"
- Moving "OpenTTD - Django"
-- Unused at the moment
- Moving "OpenTTD - Jira"
-- This is merely a test machine
- Moving "Proxy - Email"
-- This is our anti-spam email proxy
- Moving "OpenTTD - FlySpray"
- Moving "OpenTTD - MediaWiki"
- Moving "OpenTTD - VCS"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-16:
- Moving "Gateway"
-- Using FailoverIP from OVH to migrate IP to new machine
-- Turns out it only works for IPv4; seems IPv6 will be more tricky
-- Seems OVH doesn't allow IPv6 routing without an IPv4 assigned to the VM (well, a Mac, but that you cannot do without IPv4); left 1 IP on the old machine (dedicated IP for email traffic)
-- When MachA is being reinstalled, IPv6 will be unreachable
2015-09-19:
- Shutting down "Gateway" on MachA
- Reinstalling MachA with XenServer 6.5
-- Via OVH CP, does all the magic for us
-- Because the mailserver will be unreachable, installing a SSH key so I can reach the server without the password
-- Install all the updates
-- Configuring new server (match the networks mainly)
-- Install GRE bridge for combined internal network
-- Join the other server in the pool
-- Make MachA master
- Moving "Gateway"
-- Copy "Gateway" to MachA
-- Mount it in dom0, and change configuration
-- Launch it; email and IPv6 connectivity restored
-- Moving IPs back to MachA
-- Moved every IP on their own interface; firewall can now be specific per IP what to allow based on interface
- Moving "OpenTTD - Django"
- Moving "OpenTTD - Jira"
- Moving all CF related VMs
- Moving "OpenTTD - Devs"
- Moving "OpenTTD - VCS"
- Moving "General - LDAP"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-26:
- Moving "MySQL"
- Moving "Old Machine"
-- Can't do it how I did the others; back to the old track
-- Export (~3 hours later)
-- Import (~5 hours later)
-- Sync disk data
-- Start machine
2015-09-27:
- Moving "OpenTTD - Content"
-- Same as above
- Finished!
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
I really like the updates like that.
And I somewhat know what it's like to undertake such a major upgrade.
So, thanks and good luck!
And I somewhat know what it's like to undertake such a major upgrade.
So, thanks and good luck!
Re: [Devblog] Server maintenance
Tnx for the mental support
These kind of things always sound so easily done, but they are far from it. Even more, as I want to do these things with the minimal amount of downtime and always make sure the data is secure. Otherwise it would be much more trivial tbh
These kind of things always sound so easily done, but they are far from it. Even more, as I want to do these things with the minimal amount of downtime and always make sure the data is secure. Otherwise it would be much more trivial tbh
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Today attempting to move the first real VM; one of somewhat lesser importantance (one with all the developer-homedirs). Let's see what happens
When it is done (success or not), I will continue this weekend. Let's hope for the best today
When it is done (success or not), I will continue this weekend. Let's hope for the best today
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Today I will be moving the hardest VM of all (one that has VMs inside it again); the main website, musa, ottd_content and some other minor related services will experience a short downtime while I rsync the latest changes to the temporary machine.
If that succeeds, I will move the rest of the VMs over the weekend; expect random downtime of services like wiki, bugs, vcs, balancer, MySQL, LDAP, CompileFarm, email, ssh, web. Nothing more than 5 minutes per service mentoined here, so most likely unnoticed by you the reader
If that succeeds, I will move the rest of the VMs over the weekend; expect random downtime of services like wiki, bugs, vcs, balancer, MySQL, LDAP, CompileFarm, email, ssh, web. Nothing more than 5 minutes per service mentoined here, so most likely unnoticed by you the reader
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
All VMs are moved to MachB, except for the Gateway; will give it a few hours to see if any errors are being reported. If not, I will migrate the Gateway too, and use OVH's failover IPs, and we will find out how solid that functionality works
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Seems I hit a little snag; I cannot move our current IPv6 range temporary to the new machine. I guess I will leave the gateway active on the old machine to make sure for the time being that will be routed correctly. In the weekend IPv6 connection will be lost for a few hours. Very sorry for those who are IPv6 only .. I don't really see an alternative at this point.
Also had to leave behind a single IPv4 (out of the 3), as without IPv4, IPv6 isn't being routed
Also had to leave behind a single IPv4 (out of the 3), as without IPv4, IPv6 isn't being routed
The only thing necessary for the triumph of evil is for good men to do nothing.
- Digitalfox
- Chief Executive
- Posts: 709
- Joined: 28 Oct 2004 04:42
- Location: Catch the Fox if you can... Almost 20 years and counting!
Re: [Devblog] Server maintenance
Great job, very cool diary
Re: [Devblog] Server maintenance
Tnx Digitalfox
Well, today is the day; going to format MachA. Always feels scary.. formatting machines with data on them. Do I have everything? Didn't I put a file at a strange place that is important? Etc. Well .. we will find out
Well, today is the day; going to format MachA. Always feels scary.. formatting machines with data on them. Do I have everything? Didn't I put a file at a strange place that is important? Etc. Well .. we will find out
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Moved many VMs back to their rightful place. Only a few remaining, which are more noticable for the end-user. Downtime for these boxes will also be a bit longer than the ones I did, so I will have to find a nice moment to move them So far I have moved the VMs by shutting them down, and moving them. Because both servers are now in the same pool, the VM is moved in ~5 minutes, which is about the same amount of downtime as we had with making snapshot, moving, rsyncing.
Also have to move the old old machine of 600GB. If I would do that with a shutdown + move, it would take ~8 hours. It runs the main website; so that is not acceptable. Guess I will use the same way as I moved it to this machine, by making a snapshot, moving that, and rsyncing in the end. Takes a bit more effort, but that will get the job done. Not for today; possibly for tomorrow, otherwise next weekend
Also have to move the old old machine of 600GB. If I would do that with a shutdown + move, it would take ~8 hours. It runs the main website; so that is not acceptable. Guess I will use the same way as I moved it to this machine, by making a snapshot, moving that, and rsyncing in the end. Takes a bit more effort, but that will get the job done. Not for today; possibly for tomorrow, otherwise next weekend
The only thing necessary for the triumph of evil is for good men to do nothing.
- Digitalfox
- Chief Executive
- Posts: 709
- Joined: 28 Oct 2004 04:42
- Location: Catch the Fox if you can... Almost 20 years and counting!
Re: [Devblog] Server maintenance
I'm curious, so what's the total GB on everything?
Re: [Devblog] Server maintenance
On disk, we use 1.6TB of data. That is not really fair, because one machine uses 600GB. This machine is a very old machine which ones ran all of openttd.org. Of that 600GB, only 100GB is efficiently used these days.
The total used amount if around 30%. For example, the binaries archieve is just 55GB in size (but are stored in 3 places for no reason what-so-ever). MySQL is 10GB. Wiki 5GB, FlySpray 5GB.
So the on disk space is a lot bigger than the real used bytes. And I really have to clean up the old old machine ... 600GB disks are annoying
The total used amount if around 30%. For example, the binaries archieve is just 55GB in size (but are stored in 3 places for no reason what-so-ever). MySQL is 10GB. Wiki 5GB, FlySpray 5GB.
So the on disk space is a lot bigger than the real used bytes. And I really have to clean up the old old machine ... 600GB disks are annoying
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Weekend #4 .. and hopefully the last. I have 3 more machines to move, both critical and huge:
MySQL: when pushed offline, many services fail .. so I have to be careful and quick
Old Machine: 600GB of data .. ugh
OpenTTD - Content: contains all the public files .. lot of data .. ugh
I have had rather enough of this movement, I have to admit. 4 weekends of moving data one way, then the other, is starting to get to me. Well, it should be the last weekend, then we can start with the fun part: upgrading to Jessie
Hehe
Owh well, expect some downtime of some services this evening. It should be ~5 minutes each, so not too terrible
MySQL: when pushed offline, many services fail .. so I have to be careful and quick
Old Machine: 600GB of data .. ugh
OpenTTD - Content: contains all the public files .. lot of data .. ugh
I have had rather enough of this movement, I have to admit. 4 weekends of moving data one way, then the other, is starting to get to me. Well, it should be the last weekend, then we can start with the fun part: upgrading to Jessie
Hehe
Owh well, expect some downtime of some services this evening. It should be ~5 minutes each, so not too terrible
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance
Right. Finished with everything. All VMs are back on their original server, only this time on XenServer 6.5. Time to prepare Jessie upgrade
The only thing necessary for the triumph of evil is for good men to do nothing.
Re: [Devblog] Server maintenance -- Finished
Thanks for keeping us updated TrueBrain!
Who is online
Users browsing this forum: Ahrefs [Bot] and 3 guests