It's the scenario you always dread, to have an outage that you could have avoided but the work that would have avoided it wasn't enough of a priority to get done.
I've been meaning to restructure my LAN for a while because I have a single point of failure. Two days ago, my worst nightmare came true.
While I was at a different office, I watched the network at my primary site go down on my monitoring platform. It's just the WAN connection dropping, I thought to myself. It wasn't.
Here's a rough diagram of what my network looked like before.
I'm sure you can guess what had happened. The "core" switch (the only thing that differentiates this switch from the other switches is that it is a single point of failure... it is the same model of hardware as all the other switches) had died. Most likely the power supply had burnt out, all I know is: it turned off and has never turned back on.
Here's an 'after' shot.
Luckily I had a spare swtich lying around. I was able to get a colleague to plug it in, and remoted into his laptop and got it configured. Overall the outage lasted a couple of hours.
Moral of the story:
Anything That Can Possibly Go Wrong, Does [1]
If you put things off for long enough, it will come back to bite you in the arse. However tempting it is to work on new stuff, you've got to keep your house in order.
I might do a follow up post about fixing this at some point.
Epigraph of The Butcher: The Ascent of Yerupaja (1952) by John Sack ↩︎