[Devblog] Server maintenance -- Finished
Posted: 05 Sep 2015 09:23
Hi guys,
Today I write you about some SysOp work that will take place in the next month. For those that are interested, I will try to keep this updated a bit how things are going etc.
So what is going on?
For a while now we are running XenServer 6.2, and with 6.5 out, we would like to upgrade. Mainly because Jessie is not supported by 6.2 (PyGrub issues), and we really would like to upgrade all our VMs to Jessie over time. This will also allow us to use Docker, which would improve the CompileFarm greatly. Enough reasons to upgrade, basically.
Sadly, upgrading XenServer to a newer version is far from trivial, even more so if you run on a single server. When you have real access to a machine it is possible, but without it, nearly impossible. So to resolve that hurdle, we decided to temporary add a second machine to our cluster. That way we can group up both servers, migrate everything from one to the other, upgrade the old, and migrate everything back. Because OVH is awesome, getting another identical machine is very easy, so a plan was crafted shortly after.
The rundown of what will happen, and basically our internal information we passed around is here:
https://devs.openttd.org/~truebrain/upgrade-6.5.txt
As you can see, a lot of time is put into planning ahead before we attempt anything like this.
Today we installed the second server, which means from today we have a month to get everything setup (because, as you might have guessed, we only have the second server for a month ). I will update this post with the progress and the status etc.
If at any point you notice any issue which might be related to this plan, please do give a shout. If you have any suggestions how we can do this better, please do tell!
2015-09-05:
- Installed XenServer 6.5 on MachB
- Issue one: systems don't want to join the same pool:
-- OVH only delivers XenServer 6.5 templates
-- a 6.5 host cannot join a 6.2 pool (so MachB cannot join a pool from MachA)
-- a 6.2 host cannot join a 6.5 pool while having VMs running (MachA has VMs running, and that is exactly what I want/need)
-- Basically there is no way around this; falling back to export / import of VMs twice I guess ..
- Created GRE between both hosts, giving access to the same internal network on both hosts
-- Tested communication between VMs - all seem to be working; we now have 1 internal network over two hosts
- Upgrade 6.5 host to latest
- Initial tests migrating a single VM with minimal amount of downtime
-- Snapshot the VM
-- Export snapshot, copy to MachB, import snapshot
--- Works fine, but not fileless (XenServer no longer supports vm-import from stdin, sadly enough); this means I need enough diskspace to export/import each VM
-- Shut down of VM on MachA
-- Mounting disk on dom0 on both machines, to rsync the remaining files after shutdown
-- Start of VM on MachB
-- Network and everything is working and functional, all files arrived safely on new machine, including those created after the snapshot
--- This means the downtime is only as long as the rsync takes to sync the final changes
-- Initial tests successful finished. Will have to write a few scripts to speed up the rsync steps. Requires 9 commands per machine, with easy-made typos.
2015-09-06:
- Wrote the scripts to automate most part of it
-- A script that mounts a VDI to dom0, and unmounts after pressing enter
-- A script that creates a snapshot and exports it to a remote system
-- Rsync command do to the latest rsync
- Testing the move with a test VM
- All preflight checks are green; guess tomorrow I will move the first real VM, and see what happens.
2015-09-07:
- First real VM move
-- Exported "OpenTTD - Devs" VM from MachA (50GB disk, 21GB snapshot on disk)
-- Imported "OpenTTD - Devs" VM on MachB
-- Shut down "OpenTTD - Devs" VM on MachA
-- Rsyncing latest changes
-- Booting "OpenTTD - Devs" VM on MachB
-- Up&running; all seems to be working correctly
2015-09-11:
- Transfering our biggest VM (the oldest of the bunch): 600GB of data. This will take a few hours.
2015-09-12:
- 300GB at 23 MB/s, took ~4 hours, but the VM arrived on the new machine
-- This machine is the hardest of all, as it has VMs inside (and uses LVM and XFS)
-- Importing the machine
-- Booting without network
-- Change primary IP address, remove auto-start of VMs inside
-- Reboot with network
-- Shut down first internal VM
-- Rsync latest changes
-- Start up on new machine
-- Repeat for the other 2 VMs
-- (This VM is one that should be shut down, but it still is a critical part of the infrastructure; how that goes with these kinds of gradual upgrades, there is always something that cannot be "just" migrated away)
- Migrating the VMs for the CF
-- Total of 5 VMs
-- They can be shut down and moved; a lot easier and faster
2015-09-13:
- Moving "OpenTTD - Content"
-- This is one of the bigger VMs (200+GB); contains all the binaries and stuff
- Moving "General - LDAP"
-- This handles all authentication
- Moving "General - MySQL"
- Moving "OpenTTD - Django"
-- Unused at the moment
- Moving "OpenTTD - Jira"
-- This is merely a test machine
- Moving "Proxy - Email"
-- This is our anti-spam email proxy
- Moving "OpenTTD - FlySpray"
- Moving "OpenTTD - MediaWiki"
- Moving "OpenTTD - VCS"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-16:
- Moving "Gateway"
-- Using FailoverIP from OVH to migrate IP to new machine
-- Turns out it only works for IPv4; seems IPv6 will be more tricky
-- Seems OVH doesn't allow IPv6 routing without an IPv4 assigned to the VM (well, a Mac, but that you cannot do without IPv4); left 1 IP on the old machine (dedicated IP for email traffic)
-- When MachA is being reinstalled, IPv6 will be unreachable
2015-09-19:
- Shutting down "Gateway" on MachA
- Reinstalling MachA with XenServer 6.5
-- Via OVH CP, does all the magic for us
-- Because the mailserver will be unreachable, installing a SSH key so I can reach the server without the password
-- Install all the updates
-- Configuring new server (match the networks mainly)
-- Install GRE bridge for combined internal network
-- Join the other server in the pool
-- Make MachA master
- Moving "Gateway"
-- Copy "Gateway" to MachA
-- Mount it in dom0, and change configuration
-- Launch it; email and IPv6 connectivity restored
-- Moving IPs back to MachA
-- Moved every IP on their own interface; firewall can now be specific per IP what to allow based on interface
- Moving "OpenTTD - Django"
- Moving "OpenTTD - Jira"
- Moving all CF related VMs
- Moving "OpenTTD - Devs"
- Moving "OpenTTD - VCS"
- Moving "General - LDAP"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-26:
- Moving "MySQL"
- Moving "Old Machine"
-- Can't do it how I did the others; back to the old track
-- Export (~3 hours later)
-- Import (~5 hours later)
-- Sync disk data
-- Start machine
2015-09-27:
- Moving "OpenTTD - Content"
-- Same as above
- Finished!
Today I write you about some SysOp work that will take place in the next month. For those that are interested, I will try to keep this updated a bit how things are going etc.
So what is going on?
For a while now we are running XenServer 6.2, and with 6.5 out, we would like to upgrade. Mainly because Jessie is not supported by 6.2 (PyGrub issues), and we really would like to upgrade all our VMs to Jessie over time. This will also allow us to use Docker, which would improve the CompileFarm greatly. Enough reasons to upgrade, basically.
Sadly, upgrading XenServer to a newer version is far from trivial, even more so if you run on a single server. When you have real access to a machine it is possible, but without it, nearly impossible. So to resolve that hurdle, we decided to temporary add a second machine to our cluster. That way we can group up both servers, migrate everything from one to the other, upgrade the old, and migrate everything back. Because OVH is awesome, getting another identical machine is very easy, so a plan was crafted shortly after.
The rundown of what will happen, and basically our internal information we passed around is here:
https://devs.openttd.org/~truebrain/upgrade-6.5.txt
As you can see, a lot of time is put into planning ahead before we attempt anything like this.
Today we installed the second server, which means from today we have a month to get everything setup (because, as you might have guessed, we only have the second server for a month ). I will update this post with the progress and the status etc.
If at any point you notice any issue which might be related to this plan, please do give a shout. If you have any suggestions how we can do this better, please do tell!
2015-09-05:
- Installed XenServer 6.5 on MachB
- Issue one: systems don't want to join the same pool:
-- OVH only delivers XenServer 6.5 templates
-- a 6.5 host cannot join a 6.2 pool (so MachB cannot join a pool from MachA)
-- a 6.2 host cannot join a 6.5 pool while having VMs running (MachA has VMs running, and that is exactly what I want/need)
-- Basically there is no way around this; falling back to export / import of VMs twice I guess ..
- Created GRE between both hosts, giving access to the same internal network on both hosts
-- Tested communication between VMs - all seem to be working; we now have 1 internal network over two hosts
- Upgrade 6.5 host to latest
- Initial tests migrating a single VM with minimal amount of downtime
-- Snapshot the VM
-- Export snapshot, copy to MachB, import snapshot
--- Works fine, but not fileless (XenServer no longer supports vm-import from stdin, sadly enough); this means I need enough diskspace to export/import each VM
-- Shut down of VM on MachA
-- Mounting disk on dom0 on both machines, to rsync the remaining files after shutdown
-- Start of VM on MachB
-- Network and everything is working and functional, all files arrived safely on new machine, including those created after the snapshot
--- This means the downtime is only as long as the rsync takes to sync the final changes
-- Initial tests successful finished. Will have to write a few scripts to speed up the rsync steps. Requires 9 commands per machine, with easy-made typos.
2015-09-06:
- Wrote the scripts to automate most part of it
-- A script that mounts a VDI to dom0, and unmounts after pressing enter
-- A script that creates a snapshot and exports it to a remote system
-- Rsync command do to the latest rsync
- Testing the move with a test VM
- All preflight checks are green; guess tomorrow I will move the first real VM, and see what happens.
2015-09-07:
- First real VM move
-- Exported "OpenTTD - Devs" VM from MachA (50GB disk, 21GB snapshot on disk)
-- Imported "OpenTTD - Devs" VM on MachB
-- Shut down "OpenTTD - Devs" VM on MachA
-- Rsyncing latest changes
-- Booting "OpenTTD - Devs" VM on MachB
-- Up&running; all seems to be working correctly
2015-09-11:
- Transfering our biggest VM (the oldest of the bunch): 600GB of data. This will take a few hours.
2015-09-12:
- 300GB at 23 MB/s, took ~4 hours, but the VM arrived on the new machine
-- This machine is the hardest of all, as it has VMs inside (and uses LVM and XFS)
-- Importing the machine
-- Booting without network
-- Change primary IP address, remove auto-start of VMs inside
-- Reboot with network
-- Shut down first internal VM
-- Rsync latest changes
-- Start up on new machine
-- Repeat for the other 2 VMs
-- (This VM is one that should be shut down, but it still is a critical part of the infrastructure; how that goes with these kinds of gradual upgrades, there is always something that cannot be "just" migrated away)
- Migrating the VMs for the CF
-- Total of 5 VMs
-- They can be shut down and moved; a lot easier and faster
2015-09-13:
- Moving "OpenTTD - Content"
-- This is one of the bigger VMs (200+GB); contains all the binaries and stuff
- Moving "General - LDAP"
-- This handles all authentication
- Moving "General - MySQL"
- Moving "OpenTTD - Django"
-- Unused at the moment
- Moving "OpenTTD - Jira"
-- This is merely a test machine
- Moving "Proxy - Email"
-- This is our anti-spam email proxy
- Moving "OpenTTD - FlySpray"
- Moving "OpenTTD - MediaWiki"
- Moving "OpenTTD - VCS"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-16:
- Moving "Gateway"
-- Using FailoverIP from OVH to migrate IP to new machine
-- Turns out it only works for IPv4; seems IPv6 will be more tricky
-- Seems OVH doesn't allow IPv6 routing without an IPv4 assigned to the VM (well, a Mac, but that you cannot do without IPv4); left 1 IP on the old machine (dedicated IP for email traffic)
-- When MachA is being reinstalled, IPv6 will be unreachable
2015-09-19:
- Shutting down "Gateway" on MachA
- Reinstalling MachA with XenServer 6.5
-- Via OVH CP, does all the magic for us
-- Because the mailserver will be unreachable, installing a SSH key so I can reach the server without the password
-- Install all the updates
-- Configuring new server (match the networks mainly)
-- Install GRE bridge for combined internal network
-- Join the other server in the pool
-- Make MachA master
- Moving "Gateway"
-- Copy "Gateway" to MachA
-- Mount it in dom0, and change configuration
-- Launch it; email and IPv6 connectivity restored
-- Moving IPs back to MachA
-- Moved every IP on their own interface; firewall can now be specific per IP what to allow based on interface
- Moving "OpenTTD - Django"
- Moving "OpenTTD - Jira"
- Moving all CF related VMs
- Moving "OpenTTD - Devs"
- Moving "OpenTTD - VCS"
- Moving "General - LDAP"
- Moving "Proxy - SSH"
- Moving "Proxy - Web"
2015-09-26:
- Moving "MySQL"
- Moving "Old Machine"
-- Can't do it how I did the others; back to the old track
-- Export (~3 hours later)
-- Import (~5 hours later)
-- Sync disk data
-- Start machine
2015-09-27:
- Moving "OpenTTD - Content"
-- Same as above
- Finished!