Coming Out of an Extended Outage
In the more than 15 years of the existence of this AI3 blog, I never have had an outage of more than an hour or so. Today we come back online after an unprecedented 11 day outage. Woohoo! It has been a frustrating period.
The problem first arose after standard maintenance. We use Amazon Web Services‘ EC2 instances running Linux Ubuntu in the cloud. We had backed up our sites, taken them offline, and were doing what we thought was a routine upgrade from Ubuntu 18.04 LTS (Bionic Beaver) to 20.04 LTS (Focal Fossa). We had some hiccups in getting the system functioning and re-started, but finally did so successfully. After upgrading the server we waited 24 hours during which all Web sites ran fine. We then proceeded to do local upgrades to WordPress and some of its plugins. That is when all hell broke loose.
Upon restarting the server, we lost all SSH communications to the backend. The AWS status check messages indicated the AWS system was fine, but that the instance was not passing checks. When something like this happens, one really begins to scramble.
We did all of the necessary steps of what we thought was required to get the instance back up and running. We restored AMIs, snapshots and volumes and created new instances with combinations of those thereof. Nothing seemed to work.
After days of fighting the fight on our own, we bought support service from AWS and began working with their support staff. Though there were some time delays (overseas support, I assume), we got clear and detailed suggestions for what to try and do. Naturally, due to customer protections, AWS support is not able to manipulate instances directly, but we got the instructions to do so on our own.
It appears that we may have had a kernel or virtualization mismatch that crept in somewhere. However, after a couple of tries, we did get concise instructions about creating and attaching new volumes (drives) to our instance that resolved the problem. After years of managing the instance on our own, I was pleased to get the degree of response we did from AWS support staff. We can also buy support for a single month and then turn it off again. That is our current plan.
Knowing what we know now, or perhaps being a larger organization with more experience in remote server management, could have caused this outage to be solved in a much shorter time. (We also were not devoting full time to it.) The solution of how to properly move backups to restored volumes attached to new instances is a pretty set recipe, but one we had not baked before.
So, now we are back running. Our snapshot restored us to prior to all of the upgrades, so that task is again in front of us. This time, however, we will take greater care and backup each baby step as we move forward.
This glitch will cause us to go through our existing infrastructure with a fine-tooth comb. That effort, plus the holidays, means I will be suspending the completion of my Cooking with Python and KBpedia series until after the beginning of the new year.
Sorry for the snafu, and thanks to all of you who contacted us letting us know our sites were offline. My apologies for the extended outage. And Happy Holidays! to all.