Flashback Friday. This post on failure is from October 23, 2006. I really thought about using this during yesterday’s #citrt “biggest fail” contest.
Last week, we had all sorts of “interesting things” happen. Why should this week start out any differently? Monday, 2 AM, got the alert — Perimeter site is down. Not only have we lost our T1 connection, our back-door DSL connection is out too. Wow…that’s a lot of outage and not many things in common. OK, it could be a cable cut, but at 2 in the morning? Well…it could be a major power failure, but we have a backup generator, so that doesn’t seem likely. What could it possibly be?
Now my mind is racing? Fire? Explosion? Fire in the server room? Fire anywhere else in the building that might have set of the sprinklers, which would be almost as bad as a fire. I decided to wait until 6 to go in, but I didn’t sleep much — don’t know why I waited.
6:00 AM, arrive at the church. Well, the building is still there, no fire trucks, and from all outside indications all is fine. Hmmm… Walk in the building, still fine. Into the server room, first glance, all is fine. So why aren’t things working. Then I notice it. Some, but not all, of the servers are off. The firewall is down. The DSL firewall is down. What could do such a thing?
The answer is simple. We had a major power failure combined with a failure of our backup generator. Now, back to the title of this post. The backup generator is tested EVERY week! How could it choose to fail at this very moment? Actually, it didn’t. It failed weeks ago! The company that formerly did the maintenance and testing (formerly, as in up until today) did run the test every week. But…they didn’t ever check the results, so they had no record of it failing every week!
Lesson learned. Actually, this isn’t my problem — our facilities group is responsible for the generator, so it’s only a hassle for the IT department that the generator failed, but what an example for all of us. Testing, without monitoring, or monitoring, without alerting, or alerting, without an action plan, all end up being the same as not having anything in place at all.
As I was lying awake all morning, the number of scenarios for failure going through my head was staggering. How to have a plan to deal with them is also is also of concern. How can this be done? I think it’s through simple things. After the fact, we learned that quite a few people had known of the generator failure weeks ago, but didn’t think to tell anyone. How do we teach people to recognize that something is wrong, and take some action? It’s more than having a good disaster plan. It takes reviewing it with new staff, and even re-reviewing it regulary with staff that have been around.
I now remember last time we had a problem like this. About 7 years ago. We messed up that time, too. sigh
Update: We found the problem. The battery that cranks the generator wasn’t charged, or had gone bad, or otherwise failed. Grabbed another battery and we were fine again. Of course, this was discovered a bit after power failed the SECOND time, just moments after we’d successfully powered everything back up. We got to test our procedures twice in 12 hours. I didn’t like this test. But I learned from it!