If you tried to visit Lamav.com after 2:07am GMT+10 on April 1, you probably saw a very unflattering white screen with some black text that said “fatal error” followed by a bunch of garble (well, PHP code, but it’s sort of the same thing).
(that “Uh Oh” is my cue to sleep 4 hours out of the next 48 – oh boy! So I lived at my desk for about two days. Aren’t you glad I included a picture?)
Part 1: Best Practice
It all started innocently enough. Running an eCommerce website that sells all over the world takes daily input. Things change fast in the tech world, especially online. That means that although I only built this website five months ago, it still requires constant updates as things change – and they do change. Daily in fact. So, the cycle of updates never ends – but sometimes a major update comes along for a core aspect of the site (in this case, the part that powers our online cart) that could cause things to go haywire (I’m from Texas!) if you’re not careful.
So I decided to be careful. I got with our host, Siteground, to build what’s called a staging copy of our website. A staging copy is an identical copy of the live site, but it’s not live – only myself and those I grant access to can see it. It’s a ‘sandbox’ – a place to test things before applying them to the live site. The idea being, you can make the staging copy go haywire, but there are no worries because the main site won’t be affected.
Yes, I know it’s “Intricacies”. I was too engrossed in the details of technical documentation to check my spelling. Obviously.
but let’s back up a little here. We were paying Siteground for what’s called a “Managed” VPS (“Managed” means that basic web server operating software like Apache, MySQL, and cPanel are installed and configured for us, and VPS stands for ‘Virtual Private Server’, which means it behaves like a dedicated machine, but isn’t an actual physical server) along with daily complete backups of our website.
Let me back up a little more. Network uptime is the term used in the IT world for the total amount of functional time vs non-functional time, measured in a percentage. For example, let’s say I build a network that I want to run for 100 days. Here’s my network:
RadicallyFastNeutrinoTron2001, now with 199 Exobytes of LMNOPRAM
RadicallyFastNeutrinoTron2001, with 199 Exobytes of LMNOPRAM. Now with fire.
My network above has a 99% up time: for 99 days it operated correctly, but one day it…did not. Network uptime in this day and age is usually 99.x% – typically 99.3 – 99.9% or so. Extended downtimes are *VERY* bad for business (I used to be a network administrator of a company that would lose approximately $50,000 per minute when their networks went down), not to mention quite embarrassing for the guy in charge (where is that guy anyway…oh wait, that’d be me). I’m making this point to illustrate how unacceptable downtime really is, and also how unusual it is.
When there’s downtime, Network Administrators (or Online Marketing Directors, or whatever Series of Capitializations the guy/girl who gets yelled at when things go down has in front of his/her name) don’t sleep until the system is back online. Knowing this, the teams who’s job it is to provide said system (in this case, Siteground) typically do everything they can to ensure everything gets fixed as fast as possible. If only I had been so lucky.
Part Two: Staging, sans Curtain
One of the reasons I chose Siteground as our host was due to their nifty self-managed staging tool. This allows me to create staging environments with just a few clicks, without needing to contact them. Little did I know, Siteground’s staging feature has a flaw: when accessed via HTTPS (securely!), the staging environment writes to the livesite database. But that’s not the worst of it. The worst of it is that, (as near as I can tell without root access, which I do not have) it somehow integrates the new staging database into the website as well, meaning simply uploading a good database from backup can no longer solve the problem.
What does this mean? It basically means that I can’t use one of my site backups to restore the live site by uploading a database that hasn’t been corrupted by Siteground’s staging area error. Thus, I cannot recover the website by myself, and require their server admins to identify the issue, fix the issue, and then load a known-good backup of the server environment. But we’re okay, right, because we’re a paying customer who pays for daily server backups just in case the worst happens?! Apparently not.
Pretty much a verbatim quote of our conversation. I wish I was kidding.
Part 3: Houston, We Have a Problem
At this point, I’m pretty much appalled, and I’m not easy to appall. Siteground has just strung me along for hours before finally telling me that they can’t fix the issue and that I’ll have to wait 24 hours for a more senior person to “have the time to investigate the issue”. In the online world, 24 hours is equivalent to years in any other industry. I’ve basically been told “we don’t care, get lost”.
Thankfully, I’m paranoid, and take my own backups in addition to the backups that we pay Siteground to take and store. Since the flaw in Siteground’s tool is so severe, merely uploading these backups to their servers did not solve the problem, due to the flaw enmeshing the site databases. So because Siteground isn’t able to or is refusing to or doesn’t know how to (take your pick, I never got an answer either) restore a VPS backup, I have to upload the site somewhere else.
Part 4: The Scramble
By this point in time the site has already been down for over 8 hours. I’ve been awake all night attempting to explain the issue and get a resolution from Siteground, but as I stated above, that isn’t happening. Now I have to do something I really didn’t want to have to do, which is switch hosts. So now, I’m forced to switch hosts.
It’s not that I wanted to stay with Siteground (I didn’t) but that I’d much rather have a functional site while making the switch to a new host than not. That’s due to something called “DNS propagation”. See the high-tech schematic from a textbook I took the time to scan in below:
© Intertron Universities. Please don’t sue me.
The internet functions via a set of instructions, called a protocol. That protocol is called TCP/IP (Transmission Control Protocol/Internet Protocol). Part of those instructions say that each and every device connected to the network (that we call the internet) must have a unique address, so that the right information gets to the right device. This is called an IP address (you guessed it – Internet Protocol address) and they look something like this: 192.168.1.101.
Whenever you type in a domain name URL (short for Uniform Resource Locator, whee, fun with acronyms!) that URL is translated to an IP address. To make it even more interesting, the IP address associated with the URL can be changed. For instance, lamav.com might point to 22.214.171.124 today (which, let’s say was the IP address of the server at Siteground where our site was located – it’s not, but you get the idea).
However, as you may recall, our server at Siteground looks sort of like the earlier picture of my RadicallyFastNeutrinoTron2001 network with fire. So, I need lamav.com (in addition to all of our email addresses and other things) to point to the new server IP address.
So how does the internet “know” that lamav.com used to point to the IP address of Siteground, but now points to the IP address of our new host (which is Liquid Web, btw)? By using something called Dynamic Name Resolution, or DNS for short (even more acronyms! They make me feel smart *flex*). DNS is a huge database that correlates resource names (URL’s) with IP addresses (physical devices, for the most part). It is handled by only a handful of “Master” DNS servers, all but two (at least, to the last of my knowledge) of which are located in the USA (that’s because we invented the internet. so there.).
The “how” of making the switch is fairly straightforward for us network admin types. It’s the DNS propagation delay that is the catch. ”Propagation Delay” (dang, no acronym for that) means the time it takes for all of the master DNS servers to update their records with the changes that I made to where lamav.com points. First the records of the master DNS servers have to be updated, then the change trickles down to all of the lower DNS servers, finally reaching the DNS server at your ISP (Internet Service Provider, there I made up for the earlier lack of acronyms) and finally updating the DNS cache that sits on your very own computer. Unfortunately, this takes time. In the USA, it takes 30 minutes to 12 hours, but typically 1-2 hours tops. Unfortunately, in Australia it can take 12-72 hours (often at least 12 hours, 18 being the average that I’ve seen).
Now I get to my original point: I didn’t want to have to switch hosts because that would mean an additional ~2-18 hours of downtime, depending upon where the website visitor was located! Nooooooooooooooooooo!
But I had no choice.
Part 5 – The Switch
After a bit of frantic evaluation, I chose Liquid Web, for a variety of reasons: they own and operate their own datacenters, their VPS offers self-managed image snapshot backups, and they provide phone access to their techs, 24x7x365 – none of which Siteground provides (although they claim to offer 24x7x365 phone support, I found out that wasn’t actually the case). They were also able to pull most of the data off the old VPS, including a lot of settings that would have taken me hours to re-implement. They then restored my site backup, and got us back online. It still took all day today (April 3rd) for the DNS to propagate, but we now have a more ‘hardened’ (tech term for “resistant” – both against downtime and against outside attacks) server staffed by more competent techs. Yay!
It turned out the switch was the right thing to do in terms of time and productivity. I didn’t hear back from Siteground until well after the server migration was complete, and even then, they were unable to restore the server to anything later than 3 days prior to the crash – so much for daily backup service. If it weren’t for our new host and my personal backups, we would have experienced two days of downtime and lost three days worth of website data. This equates to somewhere in the ballpark of an ~80% uptime, which would have been considered “grossly unacceptable” ten years ago, much less today.
Part 6 – To be Determined
So now, myself and my web dev get to set out to try what we actually were planning to do in the first place – update the website. Wish us luck!
If you enjoyed this post, let us know by sharing it!